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Chapter  1 


INTRODUCTION 


This  report  summarizes  the  work  we  have  performed  for  the  project  entitled  “Modeling 
Abstraction  and  Simulation  Techniques.”  The  objective  of  this  effort  has  been  to  develop 
and  study  three  novel  complementary  directions  that  may  be  summarized  as  follows: 

1 .  Extract  additional  information  from  the  inherently  slow  simulation  process  of  complex 
systems  by  exploiting  new  concurrent  simulation  techniques. 

2.  Exploit  the  hierarchical  structure  in  multi-resolution  simulation  models  by  decompos¬ 
ing  them  in  ways  which  preserve  statistical  fidelity. 

3.  Explore  the  use  of  neural  networks  as  complex  simulation  metamodels. 

The  scope  of  the  project  has  been  to  develop  specific  methodologies  and  algorithms  and 
test  them  on  benchmark  problems  in  C4/  application  areas.  Thus,  appropriate  simulation 
models  were  built,  and  algorithms  based  on  the  proposed  new  techniques  were  developed 
and  tested.  In  many  cases,  the  benchmark  problems  studied  are  the  same  or  extensions 
of  the  ones  developed  during  our  previous  projects  “Enabling  Technologies  for  Real-Time 
Simulation”  [12]  and  “Real-Time  Simulation  Technologies  for  Complex  Systems”  [13]. 

We  begin  by  briefly  outlining  some  of  the  major  challenges  faced  by  modeling  and 
simulation  techniques  for  complex  systems  and  the  approaches  we  are  following  to  address 
these  challenges  (Section  1.1).  We  then  describe  the  organization  of  this  report  (Section 
1.2). 

1.1  Issues  in  Modeling  and  Simulation  of  Complex  Systems 


Simulation  is  widely  recognized  as  one  of  the  most  versatile  and  general-purpose  tools 
available  today  for  modeling  complex  processes  and  systems  and  for  solving  problems  in 
design,  performance  evaluation,  decision  making,  and  planning.  In  the  C4/  environment,  in 
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particular,  most  situations  that  confront  analysts  and  decision  makers  are  of  such  complexity 
that  handling  them  far  surpasses  the  scope  of  available  analytical  and  numerical  methods; 
this  leaves  simulation  as  the  only  alternative  of  “universal”  applicability.  Unfortunately, 
there  are  several  factors  that  limit  the  use  of  simulation. 

1.  For  most  situations  of  practical  interest,  simulation  is  extremely  time-consuming. 

2.  In  order  to  evaluate  different  alternatives,  one  has  to  perform  a  large  number  of 
simulations  (one  for  each  alternative).  Furthermore,  combining  and  processing  the 
resulting  data  in  a  way  which  enhances  decision  making  capability  is  a  difficult  task. 


Since  simulation  is  so  time  consuming,  it  is  usually  viewed  as  an  off-line  tool:  one  has  to 
wait  for  the  completion  of  one  or  more  simulation  runs  before  deciding  how  to  interpret  the 
results  and  how  to  proceed  next.  Our  objective  in  this  project  is  to  transform  simulation 
into  a  much  more  interactive  tool  not  only  for  “evaluation”  of  alternatives,  but  also  for 
efficient  real-time  “optimization”  over  alternatives.  It  is  also  desirable  to  utilize  simulation 
as  a  means  towards  obtaining  much  simpler,  yet  accurate,  surrogate  models  of  the  complex 
process  or  system  of  interest;  this  is  also  referred  to  as  metamodeling. 

To  achieve  the  objectives  outlined  above,  during  this  project  we  have  pursued  the  fol¬ 
lowing  complementary  directions 


1.  Model  abstraction  using  fluid  simulation:  As  already  mentioned,  Discrete-Event 
Simulation  is  time  consuming  and  impractical  due  to  the  large  number  of  events 
that  are  usually  involved.  An  alternative  abstract  modeling  paradigm  is  based  on 
Fluid  Models  (FM).  The  fluid-flow  worldview  can  provide  either  approximations  to 
complex  discrete-event  models  or  primary  models  in  their  own  right.  Furthermore, 
fluid  models  can  be  combined  with  discrete-event  models  to  develop  a  class  of  hybrid- 
systems ,  where,  the  state  of  the  system  is  described  by  discrete  as  well  as  continuous 
type  variables  and  the  system  dynamics  are  both,  time-driven  and  event-driven.  Such 
hybrid  models  can  be  used  to  model  a  fairly  broad  class  of  systems  including  battle 
engagements,  communication  networks,  manufacturing  systems  and  many  more. 

The  justification  of  FM  rests  on  the  realization  that  some  events  are  more  important 
than  others.  In  effect,  fluid  models  aggregate  several  events  into  a  single  event  making 
simulation  significantly  more  efficient.  For  example,  in  the  context  of  high  speed 
communication  networks,  the  effect  of  an  individual  packet  or  cell  on  the  entire  traffic 
process  is  virtually  infinitesimal,  not  unlike  the  effect  of  a  water  molecule  on  the  water 
flow  in  a  river.  To  appreciate  the  effectiveness  of  FM,  consider  for  example  a  discrete 
event  simulation  run  of  an  ATM  link  operating  at  622  Megabits-per-second  which 
requires  the  processing  of  over  a  million  events  per  second.  On  the  other  hand,  if 
traffic  comes  from  the  source  at  rates  that  are  piecewise-constant  functions  of  time, 
then  a  simulation  run  would  process  only  one  event  per  rate  change.  Thus,  30  rate 
changes  per  second  (as  in  certain  video  encoders)  may  require  the  processing  of  only 
30  events  per  second. 
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When  using  abstraction  techniques  (like  the  fluid  models)  it  is  important  to  determine 
the  right  resolution  (level  of  abstraction),  i.e. ,  the  number  of  events  that  are  aggre¬ 
gated  and  treated  as  a  single  event.  High  resolution  models  (detailed  simulation)  are 
impractical.  On  the  other  hand,  very  low  resolution  implies  significant  approximation 
errors.  In  this  report,  we  analyze  the  tradeoff  between  the  fidelity  of  the  simulation 
results  and  the  resolution  level  of  fluid  simulation. 

2.  Control  and  optimization  using  fluid  simulation:  In  a  complementary  direction, 
even  in  cases  where  the  accuracy  of  a  fluid  model  is  not  very  high,  the  model  might 
still  be  usable  for  the  purpose  of  control  and  optimization  rather  than  performance 
analysis.  In  this  case,  it  is  not  unreasonable  to  expect  that  one  can  identify  the  solution 
of  an  optimization  problem  based  on  a  model  which  captures  only  those  features  of 
the  underlying  “real”  system  that  are  needed  to  lead  to  the  right  solution,  but  not 
necessarily  estimate  the  corresponding  optimal  performance  with  accuracy.  Even  if 
the  exact  solution  cannot  be  obtained  by  such  “lower-resolution”  models,  one  can  still 
obtain  near-optimal  points  that  exhibit  robustness  properties  with  respect  to  certain 
aspects  of  the  model  they  are  based  on.  Such  observations  have  been  made  in  several 
contexts  (e.g.,  [63,  60,  17]). 

3.  Concurrent  simulation:  In  simulation  studies  it  is  generally  required  to  evaluate 
the  performance  </(•)  of  the  system  under  a  set  of  parameters/scenarios  {#0i  ■  ■  ■  ,&n} 
(where  each  Oi  is  generally  a  vector  quantity).  The  typical  solution  approach  is  to 
repeatedly  simulate  the  system  under  each  parameter/scenario  which  requires  at  least 
IV  +  1  simulation  runs.  If  a  typical  simulation  requires  T  time  units,  then  this  process 
requires  a  total  of  (IV+1)T  time  units.  In  the  context  of  concurrent  simulation  it  is  de¬ 
sired  to  extract  additional  information  from  a  single  simulation  run  under  a  parameter 
6 o  besides  the  measure  J{0q).  One  possibility  is  to  obtain  {  J(0q),  •  •  •  ,  J(6n)}.  In  this 
case,  it  is  also  required  that  (N  +  1)T  »  T  +  c  where  c  is  the  required  computational 
overhead.  Alternatively,  one  can  obtain  gradient  information  VJ(0)  =  [fy,  •  •  •  , 
which  can  be  used  together  with  stochastic  optimization  schemes  [17]  for  optimization 
or  to  expedite  the  training  of  a  neural  network  metamodel  (see  Chapter  4). 

4.  Model  abstraction  using  neural  networks:  In  the  context  of  simulation,  the  main 
idea  of  metamodeling  is  to  build  a  “surrogate”  model  of  the  system  of  interest  which  is 
much  simpler  (yet  accurate)  to  work  with.  This  is  essentially  analogous  to  constructing 
a  function  F(x) ,  x  =  [x\ ,  •  ■  ■  ,  xn]  from  only  a  finite  set  of  selected  samples  x1 ,  •  •  •  ,  xiU . 
The  problem,  of  course,  is  that  the  actual  function  we  are  trying  to  approximate  with 
F(x)  is  unknown.  There  are  several  approaches  to  metanrodeling,  many  of  which  are 
domain-specific  (e.g.,  see  [2,  29,  30]).  The  most  common  general  approach  is  to  try 
and  build  a  polynomial  expression.  This  is  often  inadequate  because  if  the  shape  of  the 
actual  curve  corresponding  to  F(x)  includes  sudden  jumps  and  asymptotic  behavior 
(which  is  very  often  the  case  from  experience) ,  then  polynomial  fits  to  such  curves  are 
known  to  be  poor. 

In  our  work  we  address  what  we  view  as  two  key  challenges  in  the  domain  of  rneta- 
modeling:  (a)  Maximizing  the  amount  of  information  extracted  from  simulation  so 
as  to  enhance  the  accuracy  of  the  nretanrodel  constructed  and  (b)  Obtaining  a  rneta- 
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modeling  device  of  “universal”  applicability,  i.e. ,  one  capable  of  generating  functions 
of  virtually  arbitrary  complexity.  The  first  challenge  is  addressed  through  the  con¬ 
current  simulation  approach  discussed  earlier.  The  second  involves  the  use  of  neural 
networks  as  surrogate  models.  Neural  networks  have  been  used  successfully  in  many 
areas  of  application  (e.g.,  speech  and  pattern  recognition)  and  have  generated  a  great 
deal  of  enthusiasm  for  the  promise  they  bring  (e.g.,  see  [27,  83,  53]).  The  main  idea 
is  to  view  a  neural  network  as  a  device  that  acts  as  a  “metamodel”  and  provides  a 
desired  response  curve  in  great  generality;  that  is  precisely  its  strength.  We  have  used 
several  benchmark  problems  to  date  to  compare  neural  networks  to  state-of-the-art 
alternatives  and  have  obtained  extremely  positive  results  [12,  39,  14,  66].  Here  we 
point  out  that  the  form  of  the  additional  information  extracted  through  concurrent 
simulation  is  also  important.  For  example,  it  is  not  immediately  obvious  how  gradient 
information  can  be  used  in  the  nretanrodel  development.  This  is  investigated  in  [66] 
and  Chapter  4  of  this  report. 

5.  Hierarchical  simulation  and  statistical  fidelity:  One  way  to  reduce  complexity 
is  through  hierarchical  decomposition  of  a  simulation  model.  The  challenge  here  is 
to  do  it  without  sacrificing  accuracy.  By  “accuracy”  we  mean  that  the  statistical 
information  generated  at  the  low-level,  high-resolution  simulation  model  should  be 
preserved  accurately  at  the  higher- level  models.  In  focusing  on  the  preservation  of 
the  stochastic  fidelity  in  hierarchical  battle  simulation  models  we  have  worked  with 
a  concrete  model  [12,  13]  and  analyzed  various  approaches  for  the  preservation  of 
stochastic  fidelity.  Our  effort  has  been  directed  at  developing  an  interface  between 
the  two  simulation  levels  to  preserve  the  statistics  to  the  maximum  extent  that  the 
available  computing  power  allows.  In  our  previous  project  [12]  we  initiated  a  study  of 
an  approach  based  on  clustering  or  path  bundle  grouping  which  is  further  pursued  in 
this  project. 


1.2  Organization 


The  content  of  this  report  is  organized  as  follows. 

Chapter  2:  First  we  review  the  basic  fluid  simulation  (FS)  modeling  framework  and  its 
variants:  time  stepped  simulation  (TSS),  time  driven  fluid  simulation  (TDFS),  and 
time  stepped  hybrid  simulation  (TSHS).  Subsequently  we  investigate  the  “accuracy” 
of  the  fluid  simulation  models  mainly  by  studying  the  errors  generated  by  ignoring 
the  detailed  dynamics  at  small  time  scales.  Through  multiple  simulation  studies  we 
demonstrate  how  the  resolution  affects  the  error  of  the  abstract  fluid  model. 

Chapter  3:  We  present  the  general  framework  of  concurrent  simulation.  Subsequently, 
we  adopt  the  stochastic  fluid  modeling  (SFM)  framework  (a  simple  variant  of  the  FS 
framework  presented  above)  and  based  on  it  we  derive  sample  derivative  estimates  of 
the  performance  measures  of  interest.  Furthermore,  these  derivatives  are  very  easy 
to  evaluate  and  we  show  that  they  are  unbiased  and  nonparametric,  i.e.,  they  do  not 
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depend  on  the  stochastic  processes  that  drive  the  system  dynamics.  Finally,  we  show 
that  these  derivatives  can  be  evaluated  directly  from  the  sample  path  of  the  discrete 
event  system.  In  other  words,  it  is  not  necessary  that  we  revert  to  a  fluid  simulation 
model,  rather  we  can  extract  the  sensitivity  information  directly  from  the  discrete 
event  simulator. 

Chapter  4:  We  review  the  main  concepts  involved  in  using  neural  networks  as  universal 
function  approximators.  Training  neural  networks  typically  requires  a  large  number 
of  training  points  and  since  these  points  are  obtained  from  simulation,  the  training 
process  is  very  time-consuming.  To  speed  up  the  data  collection  one  can  use  concurrent 
simulation.  It  is  straightforward  to  use  additional  information  that  is  in  the  form 
of  the  system’s  output  under  different  input  parameters.  On  the  other  hand,  it  is 
not  obvious  how  sensitivity  information  generated  by  concurrent  simulation  can  be 
used.  We  investigate  the  use  of  sensitivity  information  to  reduce  the  simulation  effort 
required  for  training  a  NN  metamodel. 

Chapter  5:  In  this  chapter,  we  discuss  the  applications  of  clustering  methods  in  hier¬ 
archical  simulation  of  complex  systems  and  in  system  modelling.  In  the  first  part, 
we  discuss  the  basic  concepts  for  multi-resolution  simulation  modelling  of  complex 
stochastic  systems.  We  argue  that  high-resolution  output  data  should  be  classified 
into  groups  that  match  underlying  patterns  or  features  of  the  system  behavior  before 
sending  group  averages  to  the  low-resolution  modules  to  maintain  the  statistical  fi¬ 
delity.  We  propose  high-dinrensional  data  clustering  as  a  key  interfacing  component 
between  simulation  modules  with  different  resolutions  and  use  unsupervised  learn¬ 
ing  schemes  to  recover  the  patterns  for  the  high-resolution  simulation  results.  In  the 
second  part,  we  give  the  examples  of  using  a  Hidden  Markov  Model  as  an  effective 
clustering  tool  for  this  task.  Subsequently,  we  apply  the  clustering  approach  to  a 
computer  security  problem  (intrusion  detection)  and  give  examples  of  using  Hidden 
Markov  Models  for  the  purpose  of  system  modelling  for  anomaly  detection. 

Chapter  6:  In  this  chapter  we  investigate  the  use  of  sensitivity  information  (obtained 
through  concurrent  estimation)  together  with  stochastic  approximation  schemes  for 
“real-time”  optimization  purposes.  Our  examples  are  derived  from  the  areas  of  com¬ 
puter  networks  and  mission  planning  in  the  context  of  Joint  Air  Operations  (JAO). 

Chapter  7:  We  present  the  main  conclusions  of  our  study,  including  lessons  learned  and 
recommendations.  We  also  outline  our  ongoing  work  and  some  future  research  direc¬ 
tions. 
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Chapter  2 


MODEL  ABSTRACTION  USING 
FLUID  MODELS 


Modelling  and  performance  evaluation  are  crucial  the  design,  development  and  management 
of  complex  systems  (e.g.,  computer  networks).  Conventional  analytical  methods  usually  rely 
on  overly  simplified  assumptions  while  discrete  event  simulation  (DES)  is  computationally 
prohibitive  (requires  long  simulation  runs).  Thus  the  scalability  of  evaluation  tools  has  been 
the  focus  of  many  studies  and  is  also  the  topic  of  this  chapter  where  we  are  motivated  by 
problems  in  traffic  engineering  [5]  as  they  apply  in  computer  networks. 


2.1  Introduction 


Several  directions  can  be  followed  to  improve  model  scalability.  Parallel  DES,  such  as 
SSF  [74]  and  PDns  (parallel/distributed  ns)  [65],  take  advantage  of  the  computational 
power  of  multiprocessors  or  distributed  computer  networks.  Another  way  is  to  raise  the 
abstraction  level  of  modelling  and  simulation.  For  example,  a  fluid  model  only  captures 
traffic  burstiness  at  a  large  timescale.  The  resulting  fluid  simulation  (FS)  [47]  tracks  the 
fluid  rate  changes  caused  by  sources  and  multiplexing  at  various  nodes  in  the  network. 
As  mentioned  in  the  previous  chapter,  since  the  frequency  of  rate  changes  is  typically 
much  lower  than  packet  transmission  rates,  it  is  expected  to  achieve  significant  simulation 
speedup.  However,  due  to  the  well-known  ripple  effect  [55],  FS  may  become  more  expensive 
than  DES  when  evaluating  large  complex  networks.  Thus  researchers  use  hybrid  methods 
to  achieve  the  quick  simulation  goal.  For  example,  Opnet  simulator  [46]  combines  analytical 
techniques  and  DES. 

In  this  chapter  we  study  time-stepped-simulation  (TSS),  an  abstract  simulation  scheme, 
where  the  time  axis  is  discretized  into  small  fixed  intervals  called  time  steps.  The  stochastic 
behavior  of  traffic  within  the  time  step  is  ignored  and  constant  arrival  rates  are  assumed. 
The  simulation  proceeds  in  a  time-driven  fashion,  updating  the  system  state  periodically. 
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PLS  (packet-level-simulation) 


TSS 


Figure  2.1:  TSS  and  PLS  in  a  queue  ( h  denotes  the  time-step  length). 


Fig.  2.1  illustrates  how  TSS  and  Packet-level-simulation  (PLS)  view  source  traffic  and  the 
queueing  evolution  in  the  buffer. 

TSS  trades  off  accuracy  for  speedup,  eliminates  the  FS  ripple  effects  and  is  capable 
of  adjusting  simulation  granularity  to  various  levels.  Moreover,  TSS  can  facilitate  parallel 
simulation  due  to  its  synchronization  nature.  Time  driven  fluid  simulation  (TDFS)  [85] 
and  time  stepped  hybrid  simulation  (TSHS)  [38]  also  belong  to  this  category.  In  TDFS  the 
fluid  rate  during  each  interval  is  the  average  rate  over  that  interval  and  queueing  backlogs 
are  tracked  in  discrete  time.  In  order  to  simulate  packet-based  protocols  such  as  TCP,  Guo 
[38]  proposes  TSHS,  where  packets  from  the  same  session  are  assumed  to  be  evenly  spaced 
within  the  time-step. 

Accuracy  is  one  of  major  issues  of  abstract  simulation  where  errors  are  mainly  due  to 
the  following. 

1.  No  packet  inter-arrival  variation  within  each  time-step  or  flow  state. 

2.  No  discrete  workload  in  fluid-type  models. 

3.  Flow  interference  in  multiplexors. 

4.  The  procedure  of  extracting  packet  statistics  (for  example,  inferring  packet  end-to-end 
delays  from  FS). 

Yan  [85]  and  Guo  [38]  give  error  bounds  of  TSS.  This  chapter  focuses  on  simulation  accuracy 
and  mainly  studies  the  error  from  ignoring  traffic  randomness  at  small  timescales. 


2.2  Impact  of  Autocorrelation 

2.2.1  TSS  for  an  M/D/1  Queue 

Consider  the  simplest  queueing  system  M/D/1.  Assume  that  the  service  time  of  one  packet 
is  d.  If  TSS  chooses  abstraction  level  h,  the  simulation  proceeds  in  time  intervals  of  length 
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h  x  d.  Let  xn  be  the  number  of  packets  in  the  system  at  the  beginning  of  time  step  n  and 
an  the  number  of  arrivals  during  this  interval.  We  have  the  following  system  equation 

Xn-\-l  —  [%n  T  0"ri  h]  ■  (2-1) 

where  [rc]+  =  max{0,x}.  We  introduce  an  auxiliary  random  variable  yn 

Un+i  =  [: Vn  -  h}+  +  an.  (2.2) 

Assuming  that  the  system  is  stable,  let  x,  y  and  a  denote  the  stationary  value  of  xn,  yn  and 
an.  We  have 

Ex  =  Ey  —  Ea.  (2.3) 

Appendix  A  gives  the  derivation  of  Equation  (2.3).  Y(z)  and  A(z)  are  the  probability 
generating  functions  (PGF)  of  y  and  a  respectively.  Equation  (2.2)  is  the  evolution  equation 
of  a  discrete-time  multi-server  single  queue  denoted  by  G/D/h  [8].  For  TSS  of  M/D/1,  we 

get 

Ey  =  Y'(z) U  =  _  |(1  -  „)  +  ^(1  -  z,)-1,  (2.4) 

where  p  is  the  utilization  of  the  system  and  ZjS  are  the  ( h  —  1)  poles  of  (2.5)  with  \zi\  <  1. 

Zheph(i-z)  _  1  =  o  (2.5) 

Combining  (2.3)  and  (2.4),  we  obtain  the  average  number  of  packets  in  the  system  for 
abstraction  level  h. 


Ex 


h- 1 


2(1  —  p) 


-(1  +  p)  +  ^(1  -  Z%) 


-1 


i—  1 


(2.6) 


Let  Eq  denote  the  theoretical  mean  number  of  packets  in  an  M/D/1  system. 


Eq 


1 

2(1  ~p) 


(2.7) 


Then,  the  TSS  absolute  abstraction  error  (A q(h,  p))  for  abstraction  level  h  and  utilization 
p  is  given  by: 


A q(h,  p)  =  Eq  —  Ex 


h  —  1  h  +  1 

1=1 


-1 


Finally,  we  also  define  the  relative  error  l  (h,  p)  as 


l(h,p) 


Aq{h,p) 

Eq 


h- 1  I  h+ 1  _  sr^h-1 

2  _r  2  "  Mi= 1 


(i  -  ziY1 


1  _ 

2(1  -P)  2 


(2.8) 


(2.9) 


where  all  z/s  can  be  obtained  using  numerical  techniques.  The  simulation  errors  are  shown 
in  Fig.  2.2  and  we  make  the  following  observations. 


0.9 


a:x 


% 


Figure  2.2:  Absolute  and  relative  simulation  errors  in  mean  system  time  for  M/D/1. 


Table  2.1:  Relative  simulation  errors  for  M/D/1  [h  =  5). 


p 

0.60 

0.65 

0.75 

0.80 

0.85 

0.90 

Err 

0.8112 

0.7544 

0.6089 

0.5182 

0.4138 

0.2939 

1.  The  simulation  accuracy  is  strongly  related  to  the  system  utilization.  The  absolute 
error  increases  while  the  relative  error  decreases  as  utilization  increases  (see  Fig.  2.2). 
As  utilization  approaches  1,  the  relative  error  reduces  to  zero  which  implies  that 
low-resolution  simulation  works  better  for  a  heavily  loaded  system. 

2.  Low  resolution  is  inaccurate  but  as  utilization  approaches  the  extremes,  very  low  or 
very  high,  the  relative  error  difference  among  different  resolutions  is  small. 

3.  The  utilization  of  practical  networks  often  stays  in  the  range  of  [0.6  ~  0.9].  Table  2.1 
gives  the  relative  errors  for  this  range.  The  simulation  errors  are  big  when  the  ab¬ 
straction  level  h  =  5.  In  the  following,  we  will  show  that  correlations  in  the  source 
traffic  make  TSS  work  better. 


2.2.2  Fluid  Queue 

Next  we  investigate  a  single  fluid  queue  with  piecewise  constant  inflow  rates.  The  finest  time 
scale  of  this  flow  is  called  frame.  We  first  show  experimentally  that  correlation  in  the  source 
traffic  results  in  better  simulation  accuracy.  Then  we  provide  an  analytical  justification. 


Experiments 

Correlated  fluid  processes  are  generated  in  a  two-step  synthetic  method:  (a)  a  basic  process 
is  cut  into  blocks  containing  B  frames  each,  and  (b)  the  blocks  are  randomly  shuffled 
while  the  frame  order  in  every  block  remains  unchanged.  The  resulting  processes  have  the 
same  first  order  statistics  as  the  basic  one  but  different  correlation.  Fig.  2.3a  shows  the 
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Fluid1 

Fluid,,  (B=200) 

Fluid3  (B=40) 

-  -A- 

Fluid4  (B=20) 

Fluidg  (B=1) 

0 - 1 - 1 - 1 - 1 - 

0.6  0.65  0.7  0.75  0.8  0.85 

utilization 


-4"  .  )K - 1 

0.9  0.95  1 


a 
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Figure  2.3:  (a)  Autocovariance  coefficient  curves  of  fluid  processes,  (b)  Relative  simulation 
errors  in  mean  system  time  of  fluid  processes  ( h  =  50). 


autocovariance  coefficient  curves  of  5  fluid  sources.  Fluid\  is  the  basic  one,  which  has  the 
strongest  correlation  within  the  longest  range.  Fluids,  generated  by  completely  shuffling 
Fluidi,  is  almost  white  noise.  Each  of  these  flows  is  used  as  an  input  to  an  infinite  buffer 
queue  which  is  simulated  under  two  abstraction  levels  h  =  1  and  h  =  50  (h  =  1  is  the 
simulation  with  the  finest  resolution).  The  two  sets  of  simulation  results  are  compared 
in  Fig.  2.3b  which  shows  the  relative  error  for  each  input  process.  Note  that  the  higher 
the  inflow  correlation,  the  smaller  the  approximation  error  while  for  Fluids  (uncorrelated 
traffic),  the  simulation  accuracy  is  low  even  for  high  utilization  system.  This  is  consistent 
with  the  observations  in  the  M/D/1  system. 


How  Correlation  Works 


TSS  uses  traffic  abstraction  and  feeds  approximate  inputs  into  queues.  Intuitively,  the 
more  an  approximate  input  looks  like  the  true  input,  the  better  the  simulation  accuracy. 
So  the  impact  of  correlation  on  simulation  performance  is  checked  based  on  how  correlation 
contributes  to  the  similarity  of  two  input  flows.  Let  Xi,  i  =  1,  2,  •  ■  •  ,  be  a  stationary  process 
that  denotes  the  fluid  rate  of  the  ith  frame  and  define  the  following  parameters: 

1.  Ee\  =  E{(xk  —  )2],  k  =  1,2, ... ,h ,  is  the  mean  square  of  the  error  of  the  kth.  frame 

within  each  time  step  and  h  is  the  abstraction  level.  If  Eel  is  small,  both  the  mean 
and  variance  of  the  error  are  small  because  Eel  =  V ar | |  +  {E | e*, [)2 .  This  variable 
indicates  how  close  the  £;th  frames  are  in  the  original  and  approximate  inflows. 

2.  F  =  ^  is  the  total  abstraction  error  that  counts  all  frame  errors  in  each 

time  step.  F  is  a  general  denotation.  Fun  and  Fre  are  particular  denotations  for 
uncorrelated  and  correlated  processes  respectively. 
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It  is  easy  to  get 


VciTX  ^  ^ 

E ek  ~  ^2  P\i~j\  ~  P\k-i\  ~  P\k-j\J  •  (2-10) 

where  p  and  Varx  are  the  autocovariance  coefficients  and  the  variance  respectively.  Uncor¬ 
related  processes  have  the  following  autocovariance  coefficients 


f  1,  n  =  0 
\  0,  n  /  0  . 


(2.11) 


and  therefore,  (2.10)  reduces  to 


Then 


Eei  =  Varx  (  1  —  -7-  )  - 
h 


Fun.  =  Varx  (  1  -  y  )  . 

h 


Appendix  A  derives  the  abstraction  error  for  a  correlated  process. 

F  —  F  —  C 

1  re  —  ±  un  • 


(2.12) 


where 


c  = 


h2 


1=1 


(2.13) 


Discussion: 

1.  Combining  Eqs.  (2.13)  and  (2.12)  we  see  that  the  abstraction  error  of  the  correlated 
process  is  smaller  than  that  of  the  corresponding  uncorrelated  process  if  the  two  inputs 
have  the  same  first  order  statistics.  Also,  the  larger  C  is,  the  smaller  the  abstraction 
error  and  so  the  better  the  simulation  accuracy. 

2.  According  to  Eq.  (2.13),  autocovariance  coefficients  in  the  shorter  range  are  given 
larger  weights.  Better  simulation  performance  is  expected  for  fluid  traffic  with  stronger 
correlations  in  short  range.  We  conjecture  that  TSS  with  abstraction  level  h  well  sim¬ 
ulates  a  single  fluid  queue  if  its  input  has  strong  correlation  up  to  h  frames.  Table  2.2 
gives  the  errors  of  the  5  fluid  processes  mentioned  in  Sect.  2.2.2.  It  shows  that  ab¬ 
straction  error  increases  as  correlation  decreases. 

3.  The  variance  of  the  rate  of  the  fluid  process  also  affects  the  simulation  performance 
as  seen  in  Eq.  (2.13),  where  C  is  proportional  to  this  variance. 

It  is  more  suitable  to  describe  the  traffic  of  computer  networks  by  point  processes.  The 
impact  of  autocorrelation  is  also  confirmed  by  experiments  where  more  general  queueing 
systems  with  point  processes  are  investigated.  Please  refer  to  [84]  for  details. 
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Table  2.2:  Statistical  errors  in  input  fluid  processes  brought  by  coarse  simulation  (h  =  10) 


Processes  Approximated 

Fluidi 

Fluids 

Fluids 

Fluids 

Fluids 

E\e |  (Experimental) 

0.2988 

0.2988 

0.5858 

0.7669 

1.0733 

F  (Experimental) 

0.4102 

0.4102 

0.8103 

1.0661 

1.5139 

F  (Analytical) 

0.4105 

0.4900 

0.8207 

1.0874 

1.5137 

2.3  Short-term  and  Long-term  Traffic  Characteristics 


The  previous  section  observes  the  impact  of  traffic  correlation  on  the  accuracy  of  time  step 
simulation.  This  section  uses  sample  path  analysis  to  justify  the  impact  of  correlation  and 
discusses  the  impact  of  burstiness  at  different  levels. 


2.3.1  Theoretical  Explanation 

Consider  a  single-server  queue  with  a  constant  service  rate,  where  the  buffer  is  assumed 
infinite  unless  other  specified.  Cut  the  queueing  sample  path  into  blocks  of  length  h.  Each 
queueing  block  contains  packet  arrival  information,  the  service  rate  and  the  initial  backlog. 
Fig.  2.1  shows  some  queueing  blocks.  Nonempty  queueing  blocks  are  classified  into  two 
categories,  (a)  Fully  busy  blocks  where  the  buffer  is  never  empty  during  the  block  and,  (b) 
partially  busy  blocks  where  the  buffer  is  empty  for  some  time.  A  busy  train  is  a  succession 
of  fully  busy  blocks.  Also,  define  the  following  parameters  for  a  busy  train  of  n  time  steps. 

•  //:  constant  service  rate 

•  ti  =  ih ,  i  £  {0, 1,  •  •  •  ,  n\ 

•  QiF-  average  queue  length  during  [C_i,  t{)  in  fine-time-scale  simulation. 

•  QiT :  average  queue  length  during  in  time-step  simulation. 

•  aj(i):  rate  function  during  [ti-\,ti) 

•  af  average  rate  during  [C_i ,  tt) 

•  Qbf ■  average  queue  length  during  the  busy  period  [0,  nh]  in  fine-time-scale  simulation. 

•  Qbt'-  average  queue  length  during  the  busy  period  [0 ,nh]  in  time-step  simulation. 

First  we  assume  that  TSS  has  the  same  initial  backlog  as  the  fine-time-scale  simulation  as 
shown  in  Fig.  2.4a.  A  fully  busy  block  is  shown  in  Fig.  2.4b  where  the  solid  line  demonstrates 
the  fluctuation  of  the  buffered  workload  in  fine-scale  simulation.  Also,  define 

•  QAp :  Queueing  area  in  fine-time-scale  simulation 

•  QAt'.  Queueing  area  in  time-step  simulation 
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Figure  2.4:  (a)  A  busy  train,  (b)  A  fully  busy  queueing  block. 


Figure  2.5:  A  busy  train  with  initial  discrepancy  between  TSS  and  PLS. 


It  is  easy  to  get 

rh  rh 

QAp  =  QAt+  /  tadt  —  /  ta(t)dt 

Jo  Jo 

=  QAt  —  /  tb(t)dt 

Jo 

where  b(t)  =  a(t, )  —  a  is  the  rate  difference  function.  Therefore, 

1  rh 

Qf  =  Qt  ~  r  tb(t)dt 

h  Jo 

where, 

f  b(t)dt  =  0,  b{t)  >  —a 

Jo 


Then  for  a  busy  train, 


1  J1  ^  1  [h 

Qbf  =  Qbt - V7  /  tbi(t)dt 

n  i= i  h 


where  bi(t)  is  the  rate  difference  function  during 


(2.14) 


(2.15) 
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If  TSS  does  not  have  the  same  initial  backlog  as  the  fine-time-scale  simulation  (as  in 
Fig.  2.5),  we  have 


i  r 

Qf  =  Qt~  7  /  tbi(t)dt  +  A 
h  Jo 


(2.16) 


where  A  <  ///;,  is  determined  by  the  backlog  difference  at  the  time  step  before  the  busy 
train. 


Subsequently,  we  use  queueing  blocks  as  a  comparison  unit  and  define  the  simulation 
error  of  queueing  block  i  as 

\Q%f  —  Qit\ 


e,;  = 


QiF 


(2.17) 


Therefore,  the  simulation  error  for  the  whole  trace  with  N  nonempty  blocks  is 


i=  1 


(2.18) 


Note  that  Qit  of  partially  busy  blocks  is  much  smaller  than  QiF,  thus  e,  «  1.  Let  a  denote 
the  ratio  of  the  fully  busy  blocks  among  the  total  nonempty  blocks.  Then  the  simulation 
error  of  the  whole  trace  is  expressed  as 


Discussion: 


e  ~  (1 


l 

cr)  x  1  ~f-  ot — 


Jq  tb(t)dt  +  A 
Qf 


(2.19) 


1.  The  error  in  partially  busy  blocks  is  usually  larger  than  that  of  fully  busy  blocks. 
Long  busy  periods  increase  the  percentage  of  fully  busy  blocks  a  and  help  to  reduce 
the  total  simulation  error.  A  heavily  loaded  system  has  a  large  percentage  of  fully 
busy  blocks.  Thus  TSS  works  generally  well  for  this  kind  of  system.  For  the  mildly  or 
lightly  loaded  systems,  the  performance  of  TSS  depends  on  the  traffic’s  characteristics. 
Under  the  same  system  utilization,  long  busy  periods  shows  up  only  for  some  cases. 

2.  As  Fig.  2.4  and  Fig.  2.5  show,  the  large  backlog  during  busy  periods  lessens  the 
impacts  of  local  rate  variation  and  initial  backlog  discrepancy.  In  a  busy  train  of  n 
blocks  where  the  initial  backlog  is  assumed  zero,  Qbt  is  decided  by  {di,  i  =  1,  •  •  ■  ,  n} 

Qbt  =  ^[(n-zjoi  (2-20) 

1=1 
n 

aj  =  n/j,  (2-21) 

i= 1 

di  >  0  (2.22) 

In  (2.20),  the  weights  given  to  di  decrease  as  i  increases.  Combining  constraints 
(2.21)  and  (2.22),  it  is  induced  that  the  higher  arrival  rates  at  the  beginning  of  the 
busy  period  and  the  lower  rates  at  the  end  lead  to  larger  Qbt ■  Also,  the  larger  the 
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Figure  2.6:  TSS  poorly  captures  queueing  dynamics. 


rate  difference  between  the  two  ends,  the  larger  Qbt  is.  At  time  scale  h,  if  traffic  is 
characterized  by  the  concentration  of  periods  with  much  higher  arrival  rates  and  the 
concentration  of  periods  with  very  low  arrival  rate,  it  is  easy  to  have  large  queueing 
building-up  which  results  in  reasonable  simulation  accuracy. 


So  far  we  have  showed  that  TSS  abstraction  errors  are  due  to  queueing  nonlinearity  and 
rate  variation  during  each  time  step.  Thus,  to  reasonably  evaluate  a  queueing  process  via 
TSS,  the  queueing  process  is  required  to  show  (a)  long  busy  periods  thus,  reduce  queueing 
nonlinearity,  and  (b)  large  queueing  length  during  busy  periods  thus,  reduce  the  impact  of 
local  rate  variation 

A  heavily-loaded  system  generally  satisfies  the  above  two  requirements.  For  a  mildly- 
loaded  system,  acceptable  accuracy  requires  that  traffic  shows  long  bursts  with  strong  bursty 
intensity.  Otherwise,  TSS  can  not  reflect  real  queueing  process  as  shown  in  Fig.  2.6. 


2.3.2  Experiments 

By  looking  into  the  sample  path  of  queueing  process,  we  identify  the  factors  that  influence 
simulation  accuracy  and  give  analytical  explanation.  Here  we  show  some  experimental 
results  to  confirm  our  analysis.  In  the  following  discussion,  short-term  refers  to  the  dynamics 
during  a  time  step  (how  arrivals  are  spaced).  Long-term  refers  to  a  time  scale  equal  to  or 
larger  than  the  time  step.  In  this  study,  we  are  mainly  interested  in  the  first  and  second 
order  statistics  of  long-term  traffic  variability. 


Traffic  Model 

We  use  a  Hierarchical  On-Off  Modulated  Poisson  Process  (HOMPP),  which  is  based  on  the 
Hierarchical  On-Off  Process  (HOP)  [61].  An  n- level  HOP  Y(t)  is  defined  as: 

n 

y(f)  =  I]W(t)  (2.23) 
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2-  layer  HOMPP 


a)  Before  peeling 


b)  After  peeling  Layer  2 


Figure  2.7:  Peeling  operation  on  HOMPP  models. 


where  each  Xt(t)  is  an  independent  On-Off  process.  Y(t)  is  in  the  “on”  state  only  if  all 
component  processes  are  “on”.  HOMPP  is  generated  from  HOP  Y(t)  by  modulating  a 
Poisson  process.  In  this  study,  the  timescales  of  different  layers  are  disparate  and  Layer  n 
works  on  the  largest  timescale.  Any  layer’s  “on”  periods  are  concentrated  within  the  “on” 
periods  of  its  direct  parent  layer.  This  hierarchical  model  introduces  burstiness  at  different 
timescales.  We  also  define  the  “ peel ”  operation  at  Layer  i  where  some  of  the  “on”  periods 
of  Layer  i  —  1,  rather  than  occurring  during  the  “on”  period  of  Layer  i,  they  are  uniformly 
distributed  over  the  time  axis  (with  no  overlap)  as  shown  in  Fig.  2.7. 

A  HOMPP  can  model  several  levels  of  burstiness  while  peeling  can  adjust  the  degree 
of  burstiness.  HOMPP  is  used  to  investigate  the  impact  of  traffic  dynamics  at  different 
timescales  on  TSS  accuracy. 


Long  Term  Dynamics 

We  compare  simulation  performance  of  hierarchical  traffic.  Table  2.3  lists  the  characteristics 
of  three  traffic  sources.  HI  has  three  on-off  layers.  On  and  off  periods  are  independent  and 
exponentially  distributed.  On  average,  “on”  periods  in  Layer  i  contain  five  “on”  periods  in 
Layer  i  —  1.  H1P3  is  generated  by  peeling  Hi’s  Layer  3  and  H1P2  by  peeling  Layers  3 
and  2.  The  three  traffic  sources  exhibit  different  burstiness  properties.  Let  the  time  step 
h,  and  the  average  “on”  period  length  of  Layer  2  be  equal  to  25.  Next,  we  investigate  TSS 
performance  and  source  characteristics  at  this  time  scale.  Fluid  trunks  with  high  arrival 
workload  in  H1P3,  is  less  concentrated  than  that  of  HI  because  H1H3  lacks  Layer  3.  As 
expected,  in  this  time  scale,  H1H3  has  the  same  marginal  distribution  and  less  correlation 
as  shown  in  Fig.  2.8a  and  Fig.  2.8b.  In  Fig.  2.9,  H1P3  has  worse  simulation  performance 
than  HI  because  the  queueing  due  to  workload  accumulation  in  larger  time  scale  is  less 
than  that  of  HI. 

Next  we  compare  the  traffic  characteristics  of  HI  and  HIP2.  In  Fig.  2.8b,  H1P2  lacks 
variability  in  its  marginal  distribution  while  arrivals  in  HI  are  more  widely  spread.  H1P2 
only  has  one  on-off  layer  and  the  “on”  periods  are  on  average  1  second  long,  uniformly 
spaced  on  the  time  axis.  So  when  observing  this  trace  at  a  time  scale  of  25  seconds,  the 
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Table  2.3:  Parameters  of  three  HOMPP  sources. 


Traffic 

Layer  0 

Poisson  rate(pkts/sec) 

Layer  1:  mean 
on/off  length(sec) 

Layer  2:  mean 
on /off  length  (sec) 

Layer  3:mean 
on/off  length(sec) 

HI 

100 

1/4 

25  /  100 

625  /  2500 

H1P3 

100 

1/4 

25  /  475 

- 

HIP2 

100 

1  /  88 

- 

- 

a  b 


Figure  2.8:  (a)  HOMPP  traffic  autocovariance  coefficient  curves,  (b)  Marginal  distribution 
of  HI  and  derived  traffic  (time  step  25s). 


a  b 

Figure  2.9:  (a)  Simulation  errors  of  HI  and  derived  traffic  (time  step  25s).  (b)  Simulation 
errors  of  H2  and  derived  traffic  (time  step  5s). 
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trunk  workload  fluctuates  around  the  average  value  in  a  small  range.  The  workload  amount 
in  consecutive  trunks  is  close  and  there  is  still  correlation  among  trunks  as  shown  in  Fig. 2. 8a. 
According  to  the  first  order  statistics,  H1P2  is  short  of  long  term  dynamics  compared  to 
HI.  Fig.  2.9a  shows  the  simulation  error  where,  as  expected,  HI  outperforms  H1P3  due  to 
the  rich  dynamics  in  long-term  scale.  We  emphasize  that  correlation  can  not  be  used  as  the 
only  indicator  for  long-term  dynamics.  The  first  order  statistics  should  also  be  considered. 
Take  an  extreme  case  for  example.  At  certain  time  scale,  traffic  has  constant  average  arrival 
rate  and  rate  variations  within  time  steps.  TSS  can  not  simulate  this  queue  at  this  time 
scale.  Even  though  this  trace  has  strong  correlation,  it  lacks  long-term  dynamics. 


Short-Term  Burstiness 

If  variability  in  large  time  scales  exists  in  input  traffic,  arrivals  in  long  term  are  the  major 
factor  for  queue  building.  This  case  favors  TSS.  However,  local  burstiness  degrades  TSS 
simulation  performance.  H 2  is  a  2-layer  HOMPP  with  exponentially  distributed  on-off 
periods.  On  Layer  1,  the  average  “on”  and  “off”  periods  are  Is  and  4s  respectively.  On 
Layer  2,  they  are  respectively  25s  and  125s.  Set  the  time  step  equal  to  5s.  It  is  expected 
that  arrivals  most  likely  cluster  within  1/5  of  the  interval.  Smooth  H 2  to  generate  a  new 
source  H2S.  That  is,  if  there  are  n  arrivals  in  a  time  step,  place  these  arrivals  according  to 
uniform  distribution  over  the  interval.  Therefore,  H2S  is  less  bursty  than  H 2,  and  shows 
an  improved  accuracy  as  shown  in  Fig.  2.9b. 

The  next  experiment  demonstrates  what  will  happen  if  we  make  H 2  more  locally  bursty. 
Change  H 2  local  arrival  pattern.  If  there  are  n  arrivals  in  a  time  step,  the  arrivals  are  placed 
uniformly  in  part  of  the  time  step.  The  compression  ratio  determines  how  local  bursty  the 
traffic  is.  H2B  is  the  derived  source  of  H 2  with  compression  ratio  0.01.  Its  simulation 
accuracy  does  not  degrade,  compared  to  that  of  H 2,  as  shown  in  Fig.  2.9b. 

To  study  the  impact  of  local  burstiness  when  queueing  in  large  time  scale  does  not 
dominate,  we  use  external  shuffling  to  generate  a  new  source  which  lacks  strong  dynamics 
in  long  term.  H 2  is  chopped  into  blocks  of  fixed  length  and  then  the  blocks  are  randomly 
permuted  while  keeping  the  relative  position  of  arrivals  within  a  block  unchanged.  Choose 
block  length  0.1s.  The  new  source,  H2EXT,  is  less  bursty  than  H 2  in  long  term.  Then  we 
compare  H2EXT  with  its  derived  traffic  with  compression  ratio  0.5  and  0.1.  As  show  in 
Fig.  2.10,  local  burstiness  has  an  impact. 

The  above  experiments  show  that  TSS  feasibility  strongly  depends  on  the  traffic  char¬ 
acteristics  and  the  system  utilization.  For  a  given  time  scale,  if  source  traffic  does  not  show 
significant  long-term  dynamics  (long  bursts  with  enough  intensity),  TSS  at  the  correspond¬ 
ing  time  step  works  poorly  unless  the  system  is  heavily  loaded. 


2.3.3  Compensation 

According  to  the  previous  discussion,  local  queueing  dynamics  should  be  counted  to  improve 
simulation  accuracy.  Assume  that  local  statistics  are  known,  but  the  exact  arrival  steps  are 
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Figure  2.10:  Simulation  errors  of  H2EXT  and  its  derived  traffic  (time  step  5s). 


unknown.  We  use  TSS  to  track  the  queue  evolution,  and  for  every  time  step,  add  qioc  to  qt 
to  compensate  local  queueing  dynamics. 

In  a  time  step,  given  the  traffic  local  statistics,  the  utilization  p.  zero  initial  backlog, 
and  average  q(£)  we  can  determine  qloc,  where  q(£)  is  the  queue  length  averaged  over  the 
interval  provided  the  arrival  sample  £  follows  the  local  statistics.  Use  off-line  simulation  to 
get  the  local  queueing  curve  qioc  ~  p.  Then  TSS  uses  this  curve  and  adds  qioc{p)  to  qt  to 
account  for  the  local  queueing  effects. 

In  the  following,  we  experimentally  show  the  performance  of  the  compensation  scheme. 
As  mentioned  before,  H2S  arrivals  are  uniformly  distributed  within  five-second  time  steps. 
Knowing  that,  we  get  gioc  ~  p  curves.  Fig.  2.11a  shows  the  improved  simulation  accuracy, 
especially  in  low  utilization  situations.  As  mentioned  before,  H 2  is  a  2-layer  Hierarchical 
source  and  the  average  “on”  in  Layer  2  is  25  seconds.  Within  this  time  scale,  the  local 
dynamics  are  governed  by  Layer  1  on-off  modulation.  Fig.  2.11b  shows  the  results  of  two 
compensation  schemes.  One  scheme  assumes  arrivals  are  uniformly  distributed  within  the 
interval  while  the  other  one  takes  the  burstiness  resulted  by  Layer  1  modulation,  into  consid¬ 
eration.  It  shows  that  more  accurate  information  for  local  statistics  helps  the  compensation 
performance. 

When  local  statistical  information  is  unknown,  we  use  trace-driven  methods  to  extract 
local  statistics  assuming  that  local  statistics  are  stable  for  the  whole  simulation.  Feed  the 
trace  into  a  queue.  Assuming  zero  backlog  at  the  beginning  of  every  time  step,  record  the 
average  queue  length  and  corresponding  utilization  for  every  time  step.  In  the  plane  of 
Qloc  ~  Pi  every  pair  of  records  is  a  point  and  we  use  curve  fitting  scheme  to  get  the  local 
queueing  curve.  Fig.  2.12a  shows  such  curve  for  the  source  H2EXT.  Fig.  2.12bshows  the 
compensation  results. 

Combining  local  traffic  statistics  into  TSS  makes  it  work  under  a  broad  set  of  conditions. 
Our  methods  that  get  local  queueing  curves  are  quite  rough  and  preliminary,  but  they 
emphasize  that  properly  transforming  local  statistical  information  into  macroscopic  level 
simulation  improves  accuracy.  This  scheme  provides  a  basis  for  improvement.  In  reality, 
local  traffic  statistics  are  not  expected  to  change  frequently  and  rapidly  thus  they  can 
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a  b 

Figure  2.11:  (a)  H2S  compensation  results  (time  step  5s).  (b)  H 2  compensation  results 
(time  step  5s). 


a  b 

Figure  2.12:  (a)  Local  queueing  curve  of  H2EXT  (time  step  5s).  (b)  Compensation  results 
of  H2EXT  (time  step  5s). 
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be  measured  on-line.  Moreover,  under  certain  scenarios,  there  are  clues  for  traffic  local 
statistical  properties.  For  example,  Cao  et  al.  [9]  observe  that  packet  inter-arrival  times  tend 
to  become  independent  as  the  number  of  active  connections  increases  due  to  the  statistical 
multiplexing.  For  this  case,  the  compensation  with  uniform  distribution  is  expected  to 
work. 

Discussion 

David  Nicol  et  al.  [64]  observe  small  simulation  errors  when  comparing  fluid  and  packet 
level  simulation.  Their  traffic  model  is  a  Markov  Modulated  Process  (MMP)  and  packets 
are  transmitted  at  a  specified  constant  rate  when  the  underlying  Markov  chain  is  in  some 
state.  So  their  study  does  not  consider  the  simulation  errors  from  local  traffic  dynamics. 
Instead,  they  investigate  the  errors  resulted  due  to  the  lack  of  workload  discretization  and 
flow  interference. 

Yan  [85]  derived  lower  and  upper  error  bounds  of  TSS.  For  a  single-flow  single-server 
queue,  the  distance  between  two  bounds  is  the  time  step  multiplied  by  service  rate.  The 
bounds  are  tight  only  when  queue  building  is  mainly  due  to  inter-trunk  workload  interaction. 
Otherwise,  the  bound  is  too  loose  to  be  useful  because  the  bound  distance  is  of  the  order 
of  the  actual  queue  length.  This  study  helps  in  determining  when  bounds  are  tight. 

This  study  raises  several  questions:  when  adjusting  the  system’s  granularity,  from  fine 
to  coarse,  how  to  properly  abstract  the  component’s  micro  behavior  into  a  macroscopic 
one.  In  the  single  queue  case,  is  it  always  proper  to  assume  smooth/deterministic  micro¬ 
scopic  behavior  of  traffic?  As  our  compensation  experiments  show,  combining  microscopic 
statistics  into  the  abstract  simulation  expands  simulation-working  range. 

This  research  also  raises  the  issue  of  resolution  in  traffic  modelling.  We  expect  that  for 
the  performance  evaluation  of  queueing  systems,  unless  rare  events  are  evaluated,  it  does  not 
help  much  to  model  the  rich  local  statistics  at  the  cost  of  expensive  modelling  if  traffic  shows 
strong  dynamics  in  larger  timescales.  This  is  because  the  impact  of  local  traffic  dynamics 
on  queue  is  weakened  by  strong  long-term-dynamics  in  source  traffic.  However,  fine  level 
traffic  models  such  as  multi-scaling  [28]  could  be  helpful  in  other  evaluation  scenarios. 


2.4  Summary 


Time  stepped  simulation  can  vary  the  abstraction  levels  and  point  out  critical  parts  in  a 
network  design  at  a  low  modelling  and  simulation  cost.  Current  results  are  encouraging. 
However,  in  practice,  under  what  kind  of  scenarios  can  TSS  work?  We  focus  on  the  accuracy 
analysis  of  a  single-flow  single-server  queue.  We  identify  that  the  system  utilization  and 
traffic  characteristics  at  short-term  and  long-term  timescales  affect  the  simulation  accuracy. 
Queueing  nonlinearity  and  local  rate  variation  are  two  basic  error  sources  for  the  considered 
scenarios.  Therefore,  we  propose  compensated  TSS  to  combine  local  statistical  information 
into  TSS.  Our  results  are  encouraging. 

Naturally  this  study  will  be  expanded  into  networks  of  queues  and  study  the  effects  of 
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multi-flow  interference  and  network  topology.  We  plan  to  study  the  impact  of  flow  and 
spatial  granularity  in  addition  to  time  granularity.  In  addition,  Poisson-driven  differential 
equations  are  a  powerful  tool  for  solving  some  queueing  problems.  We  are  using  this  tool  to 
solve  queueing  systems  that  use  Markov-hierarchy-onoff  fluid  input  flows.  This  will  provide 
insights  on  multi-resolution  modelling  errors. 
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Chapter  3 


CONCURRENT  SIMULATION 


It  is  by  now  well-documented  in  the  literature  that  the  nature  of  sample  paths  of  DES  can 
be  exploited  so  as  to  extract  a  significant  amount  of  information,  beyond  merely  an  estimate 
of  J(0).  It  has  been  shown  that  observing  a  sample  path  under  some  parameter  value  6 
allows  us  to  efficiently  obtain  estimates  of  derivatives  of  the  form  dJ/dO  which  are  in  many 
cases  unbiased  and  strongly  consistent  (e.g.,  see  [20,  31,  41]  where  Infinitesimal  Perturbation 
Analysis  (IPA)  and  its  extensions  are  described).  Similarly,  Finite  Perturbation  Analysis 
(FPA)  has  been  used  to  estimate  finite  differences  of  the  form  AJ(A9)  or  to  approximate 
the  derivative  dJ/d6  through  AJ/A6  when  other  PA  techniques  fail  [21]. 

All  of  the  methods  developed  to  date,  regardless  of  specific  details,  have  been  motivated 
by  the  same  objective:  From  a  single  sample  path  under  9  extract  information  to  estimate 
the  derivative  dJ/dO  or  the  response  of  the  system,  J(9'),  under  other  parameter  values 
9'  9  (see  Fig.  3.1).  This  information  can  be  extremely  useful  in  sensitivity  analysis 

and  optimization  of  DES  as  well  as  data  collection  for  metamodel  building.  Both  of  these 
applications  will  be  demonstrated  later  in  this  report.  Next  we  demonstrate  the  IPA  ap¬ 
proach  using  an  example  from  the  area  of  communication  networks  (for  more  details  see 
also  [17,  16]). 


3.1  Introduction 


A  natural  modeling  framework  for  packet-based  communication  networks  is  provided  through 
queueing  systems.  However,  the  huge  traffic  volume  that  networks  are  supporting  today 
makes  such  models  highly  impractical.  It  may  be  impossible,  for  example,  to  simulate  at 
the  packet  level  a  network  slated  to  transport  packets  at  gigabit-per-second  rates.  If,  on  the 
other  hand,  we  are  to  resort  to  analytical  techniques  from  classical  queueing  theory,  we  find 
that  traditional  traffic  models,  largely  based  on  Poisson  processes,  need  to  be  replaced  by 
more  sophisticated  stochastic  processes  that  capture  the  bursty  nature  of  realistic  traffic; 
in  addition,  we  need  to  explicitly  model  buffer  overflow  phenomena  which  typically  defy 
tractable  analytical  derivations. 
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Figure  3.1:  Concurrent  Simulation  Principle. 


An  alternative  modeling  paradigm,  based  on  Stochastic  Fluid  Models  (SFM),  has  been 
recently  considered  for  the  purpose  of  analysis  and  simulation  [4,  48,  77,  47,  49,  62,  56, 
86,  79].  The  fluid-flow  worldview  can  provide  either  approximations  to  complex  discrete- 
event  models  or  primary  models  in  their  own  right.  In  any  event,  its  justification  rests 
on  a  molecular  view  of  packets  in  moderate-to-heavy  loads  over  high-speed  transmission 
links,  where  the  effect  of  an  individual  packet  or  cell  on  the  entire  traffic  process  is  virtually 
infinitesimal,  not  unlike  the  effect  of  a  water  molecule  on  the  water  flow  in  a  river. 

Our  objective  in  this  chapter  is  no  different  from  other  perturbation  analysis  techniques: 
From  a  single  sample  path  under  6  extract  additional  information  to  estimate  the  derivative 
dJ/dd.  In  the  discrete-event  framework  such  derivative  estimates  are  often  biased.  To  avoid 
this  problem,  we  adopt  a  stochastic  fluid  model  and  derive  remarkably  simple  sensitivity 
estimators.  These  estimators  turn  out  to  be  nonparametric  in  the  sense  that  they  are 
computable  from  data  directly  observable  along  a  sample  path,  requiring  no  knowledge  of 
the  underlying  probability  law,  including  distributions  of  the  random  processes  involved,  or 
even  parameters  such  as  traffic  or  processing  rates.  In  addition,  the  estimators  obtained  are 
unbiased  under  very  weak  structural  assumptions  on  the  defining  traffic  processes.  Finally, 
because  these  estimators  are  non-parametric  we  can  evaluate  them  based  on  data  observed 
from  the  sample  path  of  the  discrete-event  system,  thus  we  do  not  necessarily  need  to 
construct  the  stochastic  fluid  equivalent  model.  In  effect,  we  use  the  SFM  only  for  the 
analysis  part  where  we  derive  the  structure  of  the  IPA  derivative  estimators.  However, 
when  we  actually  evaluate  them  we  simply  observe  the  sample  path  of  the  discrete  event 
system;  either  the  true  system  or  a  discrete-event  simulator. 

The  IPA  gradient  estimators  that  we  derive  can  be  readily  used  for  on-line  control  pur¬ 
poses.  For  example,  in  the  context  of  communication  networks  they  can  be  used  to  perform 
periodic  network  management  functions  in  order  to  guarantee  negotiated  QoS  parameters 
and  to  improve  performance.  One  such  example  is  presented  in  Chapter  6  (see  also  [17,  16] 
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for  more  details).  Aside  from  solving  explicit  optimization  problems,  IPA  gradient  esti¬ 
mators  can  be  used  for  expediting  the  data  collection  process  in  metamodel  building  as 
described  in  Chapter  4. 


3.2  The  Stochastic  Fluid  Model  (SFM)  Setting 


The  SFM  setting  is  based  on  the  fluid-flow  worldview,  where  “liquid  molecules”  flow  in 
a  continuous  fashion.  The  basic  SFM,  used  in  [80]  and  shown  in  Fig.  3.2,  consists  of  a 
single-server  (spigot)  preceded  by  a  buffer  (fluid  storage  tank),  and  it  is  characterized  by 
five  stochastic  processes,  all  defined  on  a  common  probability  space  (O,  T ,  P )  as  follows: 

•  {a(t)}:  the  input  flow  (inflow)  rate  to  the  SFM, 

•  {/3(i) the  service  rate,  i.e. ,  the  maximal  fluid  discharge  rate  from  the  server, 

•  {5(i)}:  the  output  flow  (outflow)  rate  from  the  SFM,  i.e.,  the  actual  fluid  discharge 
rate  from  the  server, 

•  {x(f)}:  the  buffer  occupancy  or  buffer  content,  i.e.,  the  volume  of  fluid  in  the  buffer, 

•  {7(t)}:  the  overflow  (spillover)  rate  due  to  excessive  incoming  fluid  at  a  full  buffer. 


The  above  processes  evolve  over  a  time  interval  [0,  T]  for  a  given  fixed  T  >  0.  The 
inflow  process  (a(t)}  and  the  service-rate  process  {/?(£)}  are  assumed  to  be  right-continuous 
piecewise  constant,  with  0  <  amin  <  a(t)  <  am ax  <  oo  and  0  <  /3min  <  (5(t)  <  /3max  <  oo. 
Let  9  denote  the  size  of  the  buffer,  which  is  the  variable  parameter  we  will  concentrate 
on  for  the  purpose  of  IPA.  The  processes  (a(i)}  and  {/3(f)},  along  with  the  buffer  size  9, 
define  the  behavior  of  the  SFM.  In  particular,  they  determine  the  buffer  content,  x(6;  t) .  the 
overflow  rate  y(0;  t),  and  the  output  flow  5(9]  t).  The  notational  dependence  on  9  indicates 
that  we  will  analyze  performance  metrics  as  functions  of  the  given  9.  We  will  assume  that 
the  real-valued  parameter  9  is  confined  to  a  closed  and  bounded  (compact)  interval  0;  to 
avoid  unnecessary  technical  complications,  we  assume  that  9  >  0  for  all  9  E  0. 


The  buffer  content  x(0\  t )  is  determined  by  the  following  one-sided  differential  equation, 


dx(6 ;  t) 
dt+ 


0,  if  x(9]t)  =  0  and  a(t)  —  (3(t)  <  0, 

0,  if  x(9;  t)  =  9  and  a(t)  —  (3(t)  >  0, 

a(t)  —  /3(f),  otherwise 


(3.1) 
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with  the  initial  condition  x(9;0)  =  xq  for  some  given  xq;  for  simplicity,  we  set  xq  =  0 
throughout  the  chapter.  The  outflow  rate  5(9;  t)  is  given  by 


6(9;  t) 


/ 3(t ),  if  x(9 ;  t)  >  0, 
a(t),  if  x(9 ;  t)  =  0, 


(3-2) 


where  we  point  out  that  if  we  allow  0  =  0,  then  5(9;t)  =  min{a(t), /3(t)}.  The  overflow 
rate  7 (0;  t )  is  given  by 


(  max{a(t)  —  /3(f),  0},  if  x(9;t)  =  9, 
\  0,  if  x(9;  t )  <  0. 


(3.3) 


This  SFM  can  be  viewed  as  a  dynamic  system  whose  input  consists  of  the  two  defining 
processes  (a(t)}  and  {/3(t)}  along  with  the  buffer  size  0,  its  state  is  comprised  of  the  buffer 
content  process,  and  its  output  includes  the  outflow  and  overflow  processes.  The  state  and 
output  processes  are  referred  to  as  derived  processes,  since  they  are  determined  by  the 
defining  processes.  Since  the  input  sample  functions  (realizations)  of  {a(f)}  and  {/3(f)}  are 
piecewise  constant  and  right-continuous,  the  state  trajectory  x(9;  t )  is  piecewise  linear  and 
continuous  in  t,  and  the  output  function  y(0;  t)  is  piecewise  constant.  Moreover,  the  state 
trajectory  can  be  decomposed  into  two  kinds  of  intervals:  empty  periods  and  busy  periods. 
Empty  Periods  (EP)  are  maximal  intervals  during  which  the  buffer  is  empty,  while  Busy 
Periods  (BP)  are  supremal  intervals  during  which  the  buffer  is  nonempty.  Observe  that 
during  an  EP  the  system  is  not  necessarily  idle  since  the  server  may  be  active;  see  (3.2). 
Note  also  that  since  x{9;  t )  is  continuous  in  t,  EPs  are  always  closed  intervals,  whereas  BPs 
are  open  intervals  unless  containing  one  of  the  end  points  0  or  T.  The  outflow  process  {d(t)} 
becomes  important  in  modeling  networks  of  SFMs  and  it  will  not  concern  us  any  further 
here,  since  our  interest  in  this  chapter  lies  in  single-node  systems. 


Let  £(9)  :  0  — ►  R  be  a  random  function  defined  over  the  underlying  probability  space 
(ST,  ,7r,  P).  Strictly  speaking,  we  write  C(9,co)  to  indicate  that  this  sample  function  depends 
on  the  sample  point  w  6  f2,  but  will  suppress  u)  unless  it  is  necessary  to  stress  this  fact. 
In  what  follows,  we  will  consider  two  performance  metrics,  the  Loss  Volume  Lt(9 )  and  the 
Cumulative  Workload  (or  just  Work)  Qt(9),  both  defined  on  the  interval  [0,  T]  via  the 
following  equations: 

Lt{9)  =  [  7 (9;t)dt,  (3.4) 

Jo 

Qt(9)  =  [  x(9;t)dt,  (3.5) 

Jo 

where,  as  already  mentioned,  we  assume  that  x(0;O)  =  0.  Observe  that  [Lt{9)\  is  the 
Expected  Loss  Rate  over  the  interval  [0,T],  a  common  performance  metric  of  interest  (from 
which  related  metrics  such  as  Loss  Probability  can  also  be  derived).  Similarly,  [Qt(9)] 
is  the  Expected  Buffer  Content  over  [0,  T].  In  this  chapter  we  are  interested  in  estimates  of 
dJi(0)/d0  and  dJQ(9)/d9  provided  by  the  sample  derivatives  dLT{9)/d9  and  dQx(9) / d9 . 
Accordingly,  the  objective  of  the  next  section  is  the  estimation  of  the  derivatives  of  Jl{9)  and 
Jq(9),  which  we  will  pursue  through  Infinitesimal  Perturbation  Analysis  (IPA)  techniques 
[41,  20]).  Henceforth  we  shall  use  the  “prime”  notation  to  denote  derivatives  with  respect  to 
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9,  and  will  proceed  to  estimate  the  derivatives  JL(0)  and  Jq{9).  The  corresponding  sample 
derivatives  are  denoted  by  L't(9)  and  Q't(9),  respectively. 


3.3  Infinitesimal  Perturbation  Analysis  (IPA)  with  respect 
to  Buffer  Size  or  Threshold 


We  will  concentrate  on  the  buffer  size  9  in  the  SFM  described  above  or,  equivalently,  a 
threshold  parameter  used  for  buffer  control.  We  assume  that  the  processes  (a(i)}  and 
{/ 3(t )}  are  independent  of  9  and  of  the  buffer  content.  Thus,  we  consider  network  settings 
operating  with  protocols  such  as  ATM  and  UDP,  but  not  TCP.  Our  objective  is  to  es¬ 
timate  the  derivatives  JL(9)  and  Jq(9)  through  the  sample  derivatives  LT(9 )  and  QT{9) 
which  are  commonly  referred  to  as  Infinitesimal  Perturbation  Analysis  (IPA)  estimators; 
comprehensive  discussions  of  IPA  and  its  applications  can  be  found  in  [41,  20].  The  IPA 
derivative-estimation  technique  computes  LT(6)  and  QT(9)  along  an  observed  sample  path 
u j.  An  IPA-based  estimate  C'{9)  of  a  performance  metric  derivative  dE[C(6)]/d9  is  unbiased 
if  dE[C{6)\/  dQ  =  E[C'(9)\.  Unbiasedness  is  the  principal  condition  for  making  the  applica¬ 
tion  of  IPA  practical,  since  it  enables  the  use  of  the  sample  (IPA)  derivative  in  control  and 
optimization  methods  that  employ  stochastic  gradient-based  techniques. 

We  consider  sample  paths  of  the  SFM  over  [0,  T}.  For  a  fixed  9  e  0,  the  interval  [0,T] 
is  divided  into  alternating  EPs  and  BPs.  Suppose  there  are  K  busy  periods  denoted  by  £>&, 
k  =  1, . . . ,  K,  in  increasing  order.  Then,  by  (3.4)-(3.5),  the  sample  performance  functions 
assume  the  following  form: 


K  r 

Me)  =  V  /  t (o-,t)dt, 

k= i  Jb* 

(3.6) 

K 

Qt{0)  =  V  /  x(9;t.)dt. 

k=ijB * 

(3.7) 

As  mentioned  earlier,  the  processes  {a(f)}  and  {/?(£)}  are  assumed  piecewise  constant.  This 
implies  that,  w.p.l,  there  exist  a  random  integer  N(T)  >  0  and  an  increasing  sequence  of 
time  points  0  =  to  <  t\  <  . . .  <  tNrT^  <  tjv(T)+i  =  T  generally  dependent  upon  the  sample 
path  < v ,  such  that  7;  is  a  jump  (discontinuity)  point  of  a(t )  —  /3(t);  clearly,  a(t)  —  /3(t)  is 
continuous  at  all  points  other  than  7oj  •  *  *  sUv(T)-  We  will  assume  that  N(T)  has  a  finite 
expectation,  i.e.,  E[N(T)]  <  oo. 

Viewed  as  a  discrete-event  system,  an  event  in  a  sample  path  of  the  SFM  may  be  either 
exogenous  or  endogenous.  An  exogenous  event  is  a  jump  in  either  {a(f)}  or  {/?(£)}.  An 
endogenous  event  is  defined  to  occur  when  the  buffer  becomes  full  or  empty.  We  note 
that  the  times  at  which  the  buffer  ceases  to  be  full  or  empty  are  locally  independent  of  9 , 
because  they  correspond  to  a  change  of  sign  in  the  difference  function  a(t)  —  j3(t )  (by  a 
random  function  f[6)  being  “locally  independent”  of  9  we  mean  that  for  a  given  9  there 
exists  A 9  >  0  such  that  for  every  9  €  (9  —  A 9,  9  +  A 9),  w.p.l  f(9)  =  f(9),  where  A 9  may 
depend  on  both  9  and  on  the  sample  path).  Thus,  given  a  BP  B its  starting  point  is  one 
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Figure  3.3:  A  typical  sample  path  of  a  SFM 


where  the  buffer  ceases  to  be  empty  and  is  therefore  locally  independent  of  9,  while  its  end 
point  generally  depends  on  9.  Denoting  these  points  by  and  r]k(9)  we  express  Bk  as 

Bk  =  (Zk,Vk(0)),  k  =  l,...,K 

for  some  random  integer  K.  The  BPs  can  be  classified  according  to  whether  some  overflow 
occurs  during  them  or  not.  Thus,  we  define  the  random  set 

*(6)  ■=  {k€{l,...,K}:  x(t)  =  9, 

a{t)  —  P(t)  >  0  for  some  t  £  r/fe(0))}. 

For  every  k  £  <h(0),  there  is  a  (random)  number  Mk  >  1  of  overflow  periods  in  Bk.  i.e.,  inter¬ 
vals  during  which  the  buffer  is  full  and  a(t)  —  (3(t)  >  0.  Let  us  denote  these  overflow  periods 
by  Tk,mi  rn  =  1, . . . ,  Mk,  in  increasing  order  and  express  them  as  J-k.m  =  [uk.mifl),  Vk:m\, 
k  =  1, . . . ,  K.  Observe  that  the  starting  time  ukjin(9)  generally  depends  on  9,  whereas  the 
ending  time  vk)m  is  locally  independent  of  9,  since  it  corresponds  to  a  change  of  sign  in  the 
difference  function  a(t)  —  f3(t),  which  has  been  assumed  independent  of  9.  Finally  let 

5(0)  =  |^)l  (3-8) 

where  |-|  denotes  the  cardinality  of  a  set,  i.e.,  B{9)  is  the  number  of  BPs  in  [0, T]  during 
which  some  overflow  is  observed.  To  summarize: 

•  There  are  K  busy  periods  in  [0,  T],  with  Bk  =  (flk.  ??fc(0)),  k  =  1, . . . ,  K. 

•  k  £  <h(0)  iff  some  overflow  occurs  during  Bk]  we  set  B(9)  =  |<L(0)|. 

•  For  each  k  £  $(0),  there  are  Mk  overflow  periods  in  Bk.  i.e.,  J-k.m  =  [uk,m{Q),  Vk,m], 
m  =  1, . .  • ,  Mk. 

A  typical  sample  path  is  shown  in  Fig.  3.3,  where  K  =  3,  $  =  {1,  3},  M\  =  2,  M2  =  0, 
Ms  =  1. 

Next  we  present  the  IPA  derivative  estimates.  Note  that  the  results  of  the  next  section 
can  be  derived  using  either  finite  differences  or  direct  sample  differentiation.  We  only  present 
the  main  results  without  any  proofs.  The  interested  reader  is  referred  to  Appendix  B  or 
[17]  for  details. 


28 


3.3.1  Infinitesimal  Perturbation  Analysis 

In  this  subsection,  we  derive  explicitly  the  sample  derivatives  L't[9 )  and  QT(9)  of  the  loss 
volume  and  work,  defined  in  (3.6)  and  (3.7),  respectively.  We  then  show  that  they  provide 
unbiased  estimators  of  the  expected  loss  volume  sensitivity  dE[L,T{0)]/d0  and  the  expected 
work  sensitivity  dE[QT(9)]/d9. 

Since  we  are  concerned  with  the  sample  derivatives  L't(9)  and  Qt(9),  we  have  to  identify 
conditions  under  which  they  exist.  Observe  that  any  endogenous  event  time  (a  time  point 
when  the  buffer  becomes  full  or  empty)  is  generally  a  function  of  6;  see  also  (3.1).  Denoting 
this  point  by  t(9),  the  derivative  t'(9 )  exists  as  long  as  t(9 )  is  not  a  jump  point  of  the 
difference  process  (a(f)  —  (3(t)}.  Recall  that  the  times  at  which  the  buffer  ceases  to  be  full 
or  empty  are  locally  independent  of  9,  because  they  correspond  to  a  change-of-sign  of  the 
difference  sample  function  a(t)  —(3{t),  which  does  not  depend  on  9.  Excluding  the  possibility 
of  the  simultaneous  occurrence  of  two  events,  the  only  situation  preventing  the  existence 
of  the  sample  derivatives  LT(9)  and  QT(9 )  involves  an  interval  during  which  x(t)  =  9  and 
a(t)  —  (3{t)  =  0,  as  seen  in  (3.3));  in  this  case,  the  one-sided  derivatives  of  Lt{9) and  Qt{9) 
exist  and  can  be  obtained  with  the  approach  of  the  previous  section.  In  order  to  keep  the 
analysis  simple,  we  focus  only  on  the  differentiable  case.  Therefore,  the  analysis  that  follows 
rests  on  the  following  technical  conditions: 

Assumption  1 

a.  W.p.l,  aft )  —  (3ft)  0. 

b.  For  every  0  6  0,  w.p.l,  no  two  events  may  occur  at  the  same  time. 

Remark  1  We  stress  the  fact  that  the  above  conditions  for  ensuring  the  existence  of  the 
sample  derivatives  LT(9 )  and  QT(9)  are  very  mild.  Part  b  above  is  satisfied  whenever  the 
cdf's  (or  conditional  cdf’s)  characterizing  the  intervals  between  exogenous  event  occurrences 
are  continuous.  For  example,  in  the  simple  case  where  (3ft)  =  (3  and  aft )  can  only  take  two 
values,  0  and  a  >  (3,  suppose  that  the  inflow  process  switches  from  a  to  0  after  9 /(a  —  (3) 
time  units  w.p.  1.  The  buffer  then  becomes  full  exactly  when  an  exogenous  event  occurs, 
and  the  loss  volume  sample  function  experiences  a  discontinuity  w.p.  1.  Such  situations 
can  only  arise  for  a  small  finite  subset  of  0  (for  which  one  can  still  calculate  either  the  left 
or  right  derivatives)  and  they  are  of  limited  practical  consequence. 

We  next  derive  the  IPA  derivatives  of  Lt{9) and  Qt(9).  Recall  that  B{9)  =  |$(0)|,  he., 
the  number  of  BPs  containing  at  least  one  overflow  period. 

Theorem  1  For  every  9  £  0, 

L'T(9)  =  —B(9).  (3.9) 

Theorem  2  For  every  9  £  0, 

Qt{°)  =  Y  \.9k(e)  ~  uk,i(9)}.  (3.10) 

fce$(0) 

In  simple  terms,  the  contribution  of  a  BP,  Bk,  to  the  sample  derivative  QT{9 )  in  (3.10) 
is  the  length  of  the  interval  defined  by  the  first  point  at  which  the  buffer  becomes  full 
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and  the  end  of  the  BP.  Once  again,  as  in  (3.9),  observe  that  the  IPA  derivative  Q't(9 )  is 
nonparametric,  since  it  requires  only  the  recording  of  times  at  which  the  buffer  becomes 
full  (i.e. ,  Uk,i{0))  and  empty  (i.e. ,  ?7fc(#))  for  any  Sfc  with  k  G  $(#). 


IPA  Unbiasedness 

We  next  show  the  unbiasedness  of  the  IPA  derivatives  LT(6 )  and  Qt(9)  obtained  above.  In 
general,  the  unbiasedness  of  an  IPA  derivative  C{9)  has  been  shown  to  be  ensured  by  the 
following  two  conditions  (see  [71],  Lemma  A2,  p.70): 

Condition  1.  For  every  6  G  0,  the  sample  derivative  C(6)  exists  w.p.l. 

Condition  2.  W.p.l,  the  random  function  C{9)  is  Lipschitz  continuous  throughout  0, 
and  the  (generally  random)  Lipschitz  constant  has  a  finite  first  moment. 

Consequently,  establishing  the  unbiasedness  of  L't{9 )  and  QT(9)  as  estimators  of  dE[Lr(9)]/d9 
and  dE[QT{0)]/d9,  respectively,  reduces  to  verifying  the  Lipschitz  continuity  of  Lt{6)  and 
Qt(9 )  with  appropriate  Lipschitz  constants.  Recall  that  N(T )  is  the  random  number  of  all 
exogenous  events  in  [0,T]  and  that  we  have  assumed  E[N(T)]  <  oo. 

Theorem  3  Under  Assumption  1, 

1.  If  E[N(T)\  <  oo,  then  the  IPA  derivative  L'T{9 )  is  an  unbiased  estimator  of  dE[L,T{9)\/ dO . 

2.  The  IPA  derivative  Q'T{9 )  is  an  unbiased  estimator  of  dE[Qx{9)\/d9. 

Remark.  For  the  more  commonly  used  performance  metrics  i^E  [. Lt(9 )]  (the  Expected 
Loss  Rate  over  [0,T])  and  ^E  [Qt(9)\  (the  Expected  Buffer  Content  over  [0,T]),  the  Lips¬ 
chitz  constants  in  Theorem  3  become  N(T)/T  and  1,  respectively.  As  T  — >  oo,  the  former 
quantity  typically  converges  to  the  exogenous  event  rate. 

3.3.2  IPA  Estimation  Algorithm 

Algorithm  1  •  Initialize  a  counter  C  :=  0  and  a  cumulative  timer  T  :=  0. 

•  Initialize  r  :=  0. 

•  If  an  overflow  event  is  observed  at  time  t  and  r  =  0: 

—  Set  t  :=  t 

•  If  a  busy  period  ends  at  time  t  and  r  >  0: 

—  Set  C  :=  C  —  1  and  T  :=  T  +  (f  —  r) 

—  Reset  t  :=  0. 

•  If  t  =  T,  and  r  >  0: 
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—  Set  C  :=  C  —  1  and  T  :=  T  +  (t  —  r). 


The  final  values  of  C  and  T  provide  the  IPA  derivatives  L't(Q )  and  Q't(0 )  respectively. 
We  remark  that  the  “overflow”  and  “end  of  BP”  events  are  readily  observable  during  actual 
network  operation.  In  addition,  we  point  out  once  again  that  these  estimates  are  indepen¬ 
dent  of  all  underlying  stochastic  features,  including  traffic  and  processing  rates.  Finally, 
the  algorithm  is  easily  modified  to  apply  to  any  interval  [Ti,T2]. 


3.4  Conclusions  and  Future  Work 


Stochastic  Fluid  Models  (SFM)  can  adequately  describe  the  dynamics  of  complex  discrete 
event  systems  or  constitute  primary  models  in  their  own  right.  In  this  chapter,  we  have 
considered  single-node  SFMs  from  the  standpoint  of  IPA  derivative  estimation.  In  partic¬ 
ular,  we  have  developed  IPA  estimators  for  the  loss  volume  and  work  as  functions  of  the 
buffer  size,  and  shown  them  to  be  unbiased  and  nonpar ametric.  The  simplicity  of  the  esti¬ 
mators  and  their  nonparametric  property  suggest  their  application  to  on-line  optimization 
problems. 

The  sample  derivative  analysis  holds  the  promise  of  considerable  extensions  to  multi¬ 
ple  SFMs  as  models  of  actual  networks  and  to  multiple  flow  classes  that  can  be  used  for 
differentiating  traffic  classes  with  different  Quality-of- Service  (QoS)  requirements.  Ongoing 
research  has  already  led  to  very  encouraging  results,  reported  in  [17,  16,  15],  involving  IPA 
estimators  and  associated  optimization  for  flow  control  purposes  in  multi-node  models. 
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Chapter  4 


NEURAL  NETWORK 
METAMODELING 


Simulation  is  one  of  the  most  powerful  tools  for  modeling  and  evaluating  the  performance 
of  complex  systems,  however,  it  is  computationally  slow.  One  approach  to  overcome  this 
limitation  is  to  develop  a  “metamodel”.  In  other  words,  generate  a  “surrogate”  model  of 
the  original  system  that  accurately  captures  the  relationships  between  input  and  output, 
yet  it  is  computationally  more  efficient  than  simulation.  Neural  networks  (NN)  are  known 
to  be  good  function  approximators  and  thus  make  good  nretanrodel  candidates.  During 
training,  a  NN  is  presented  with  several  input/output  pairs,  and  is  expected  to  learn  the 
functional  relationship  between  inputs  and  outputs  of  the  simulation  model.  So,  a  trained 
net  can  predict  the  output  for  inputs  other  than  the  ones  presented  during  training.  This 
ability  of  NNs  to  generalize  depends  on  the  number  of  training  pairs  used.  In  general,  a 
large  number  of  such  pairs  is  required  and,  since  they  are  obtained  through  simulation,  the 
nretanrodel  development  is  slow.  When  using  concurrent  simulation,  as  in  Chapter  3,  we 
can  obtain  sensitivity  information  with  respect  to  various  input  parameters.  In  this  chapter, 
we  investigate  the  use  of  sensitivity  information  to  reduce  the  simulation  effort  required  for 
training  a  NN  nretanrodel. 


4.1  Introduction 


Simulation  is  arguably  the  most  versatile  and  general-purpose  tool  available  today  for  mod¬ 
eling  complex  systems  such  as  Discrete  Event  Systems  (DES).  It  can  be  used  for  performance 
evaluation,  system  design,  decision  making,  and  planning.  Such  applications  typically  in¬ 
volve  the  use  of  simulation  to  answer  a  multitude  of  “what-if”  questions  under  various 
scenarios,  each  corresponding  to  different  parameters,  designs  or  decisions.  However,  simu¬ 
lation  is  notoriously  time  consuming.  For  complex  systems,  performance  evaluation  under 
a  single  set  of  input  parameters  can  take  several  minutes  even  hours.  As  a  result,  it  is 
impractical  (if  at  all  feasible)  to  perform  any  parametric  study  of  system  performance, 
especially  for  systems  with  a  large  parameter  space.  Unless  substantial  speedup  of  the 
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performance  evaluation  process  can  be  achieved,  systematic  performance  studies  of  most 
real-world  problems  are  beyond  reach,  even  with  supercomputers. 

One  alternative  for  achieving  the  required  speedup  is  through  “met amo deling” .  In  this 
framework,  any  simulator  is  viewed  as  a  function  that  maps  any  vector  of  input  parameters 
x  =  [x i ,  •  •  •  ,  x'a']  to  a  set  of  M  performance  metrics  of  interest  y  =  [y\,  ■  ■  ■  ,  um\,  that  is 

y  =  $(x).  (4.1) 

Since  the  evaluation  of  the  function  $(•)  is  generally  complex  and  time  consuming,  meta¬ 
modeling  seeks  a  much  simpler  and  computationally  more  efficient  “surrogate  model”  $(•) 
such  that 

$(x)  «  $(x).  (4.2) 

for  all  input  parameters  of  interest.  A  typical  approach  for  building  <!>(•)  is  to  use  simulation 
to  obtain  a  training  set  of  Q  input-output  pairs  {(x,;,  y^),  i  =  1,  •  •  •  ,  Q},  and  try  to  determine 
a  function  that  captures  the  input-output  relationship  that  generated  the  Q  samples.  Of 
course,  the  expectation  is  that  the  surrogate  model  will  be  such  that  (4.2)  will  hold  not 
only  for  the  training  pairs,  but  also  for  any  x  in  some  domain  of  interest.  This  ability  of 
the  surrogate  model  to  produce  a  reasonable  response  to  an  input  that  is  not  included  in 
the  training  set  is  referred  to  as  generalization. 

Several  authors  have  addressed  simulation  nretanrodeling.  [88]  used  polynomial  fitting  to 
develop  a  nretanrodel  for  a  Tactical  Electronic  Reconnaissance  Simulation  Model  (TERSM) 
that  estimates  the  number  of  ground-based  radar  sites  detected  by  a  reconnaissance  aircraft 
as  a  function  of  its  flying  mission.  Other  approaches  have  used  statistical  analysis.  [70]  used 
least  squares  estimation  for  non-linear  nretanrodel  estimation  while  [24]  used  regression 
using  Bayesian  methods.  Neural  networks  (NN)  are  generally  known  as  good  function 
approximators  and  thus  make  good  candidates  for  surrogate  functions.  [45]  have  used 
a  backpropagation  neural  net  to  capture  the  behavior  of  a  Command  and  Control  (C2) 
network.  In  some  of  our  earlier  work  [14],  we  used  a  Cascade  Correlation  NN  [26]  to 
generate  nretanrodels  for  the  TERSM  mentioned  above  and  for  an  Aircraft  Refueling  and 
Maintenance  System  (ARMS). 

Neural  network  nretanrodels,  though  versatile,  generally  require  a  large  number  of  train¬ 
ing  pairs  before  they  acquire  good  generalization  capabilities.  This  implies  that  many 
simulation  runs  are  necessary  to  build  a  nretanrodel.  To  address  this  problem,  in  our  earlier 
work  [14]  we  also  proposed  the  use  of  Concurrent  Estimation  [21]  as  a  possible  way  of  col¬ 
lecting  more  training  pairs  from  a  single  simulation  run.  In  this  work  we  propose  a  different 
approach  where  we  use  sensitivity  (derivative)  information  to  train  the  neural  network. 

More  specifically,  it  is  by  now  well-docunrented  in  the  literature  that  the  nature  of  sample 
paths  of  Discrete  Event  Systems  (DES)  can  be  exploited  so  as  to  extract  a  significant  amount 
of  information,  beyond  merely  an  estimate  of  a  performance  measure  <h(x).  It  has  been 
shown  that  observing  a  sample  path  under  some  parameter  value  x  allows  us  to  efficiently 
obtain  estimates  of  derivatives  of  the  form  d^/dx  which  are  in  many  cases  unbiased  and 
strongly  consistent  (e.g.,  see  [11,  31,  41]  where  Infinitesimal  Perturbation  Analysis  (IPA) 
and  its  extensions  are  described).  The  question  that  arises  then  is  how  the  sensitivity 
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information,  made  available  by  PA,  can  be  used  to  construct  nretanrodels.  In  this  chapter 
we  recognize  that  since  (4.2)  must  hold  for  all  x  in  the  domain  of  interest,  the  partial 
derivatives  of  the  two  functions  with  respect  to  any  Xi  should  also  be  equal.  As  a  result,  we 
modify  the  standard  backpropagation  algorithm  to  also  use  this  information  as  explained 
in  Section  4.4. 

To  our  knowledge,  this  approach  of  using  sensitivity  information  to  reduce  the  training 
sample  size  is  new.  For  the  “classification”  problem  (as  opposed  to  the  “function  approxi¬ 
mation”  problem  we  investigate  in  this  chapter)  several  authors  have  addressed  the  issue  of 
determining  the  minimum  training  sample  size  for  a  neural  network  to  have  good  general¬ 
ization  properties.  [7]  found  that  for  a  network  with  N  nodes  and  W  weights,  the  number 
of  randomly  selected  samples  required  to  achieve  correct  classification  for  at  least  (1  —  |) 
fraction  of  the  test  examples  is  m  >  O  ( log  ^ ) .  [59]  found  a  tighter  bound  assuming 
that  the  training  samples  are  chosen  close  to  the  cluster  boundaries.  The  generalization 
performance  of  practical  algorithms  is  also  the  focal  point  in  [75].  These  authors  use  the 
so  called  “ill-disposed”  algorithm  to  derive  a  probability  distribution  that  allows  them  to 
determine  a  more  realistic  bound  on  the  sample  size  as  well  as  the  average  generalization 
error. 

This  chapter  is  organized  as  follows.  In  the  next  section  we  present  the  notation  that  we 
will  use  in  the  sequel.  In  Section  4.3  we  briefly  describe  the  standard  backpropagation  neural 
network  and  in  Section  4.4  we  modify  the  algorithm  to  include  any  available  sensitivity 
information.  In  Section  4.5  we  demonstrate  the  potential  advantages  of  the  approach  with 
two  numerical  examples.  Finally  we  close  with  conclusions  and  future  plans  in  Section  4.6. 


4.2  Notation 


Figure  4.1:  3-Layer  neural  network 


For  the  purposes  of  this  chapter,  we  assume  a  3-layer  neural  network  (see  Fig.  4.1)  where 


M:  Number  of  units  (neurons)  in  the  output  layer. 
H:  Number  of  units  in  the  hidden  layer. 

N :  Number  of  units  in  the  input  layer. 
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yk :  Output  of  the  fcth  unit  of  the  output  layer,  k  =  1,  •  •  •  ,  M. 

Zji  Output  of  jth  hidden  unit,  j  =  0,  •  •  •  ,  H  {zq  =  1  corresponds  to  the  bias  input  of  output 
layer  units). 

xc.  Input  of  the  zth  unit,  i  =  0,  •  •  •  ,N  (xq  =  1  corresponds  to  the  bias  input  of  hidden  layer 
units) . 

tp  =  [ tk\p ■  Target  output  given  an  input  vector  xp,  where  k  =  1,  ■  ■  ■  ,  A/,  p  =  1,  •  ■  •  ,  Q  and 
Q  is  the  number  of  training  pairs. 

dki :  Sensitivity  of  kth  network  output  with  respect  to  its  zth  input,  dki  = 

Sp  =  [ski]p  Target  sensitivity  {dki)  given  an  input  vector  xp.  (For  DES,  we  assume  that 
this  information  is  obtained  through  some  Perturbation  Analysis  (PA)  technique). 

/(•):  Activation  function  of  output  layer  units. 

g{-):  Activation  function  of  hidden  layer  units. 

4.3  Backpropagation  Neural  Net 


The  activation  of  the  fcth  output  unit  of  a  standard,  three  layer  backpropagation  neural 


network  (BPNN)  as  a  function  of  the  input 

x  =  [xq,  •  ■  ■  ,  xn]  is  given  by: 

Vk  = 

/  (vlN) 

(4.3) 

H 

II 

fe; 

w3kZi 

(4.4) 

3=0 

zi  = 

(4.5) 

N 

zIN  ~ 
Z3 

i= 0 

(4.6) 

where,  Wjk  is  the  weight  from  the  jth  hidden  unit  to  the  input  of  the  kth  output  unit  and 
utj  is  the  weight  from  the  zth  input  to  the  jth  hidden  unit. 

The  learning  procedure  of  the  backpropagation  neural  network  is  based  on  minimizing 
the  sum  of  squared  errors.  That  is,  minimize  an  error  function  of  the  form: 

1  M 

e=2  <4-7) 
z  k=  1 

The  minimization  is  done  by  gradient  descent  methods,  where  backpropagation  involves  the 
chain  rule  to  back  propagate  errors  from  the  network’s  outputs  to  each  of  the  network’s 
weights,  see  [27]  for  details.  Next,  we  investigate  a  possible  way  of  utilizing  the  sensitivity 
information  that  can  be  obtained  through  some  PA  technique. 
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4.4  Derivative  Backpropagation  Neural  Networks 


If  equation  (4.2)  is  supposed  to  hold  for  all  x  in  the  domain  of  interest  of  x,  then  it  is 
reasonable  to  require  that: 


a$(x) 

dxi 


<94>(x) 

dxi 


i  =  !,-■■  ,N. 


(4.8) 


In  other  words,  the  sensitivity  of  the  neural  net  output  with  respect  to  each  one  of  its  inputs 
should  be  approximately  equal  to  the  sensitivity  of  the  simulation  model  with  respect  to  the 
same  inputs.  Though  complex,  it  is  possible  to  determine  the  neural  network’s  sensitivity 
with  respect  to  its  inputs  using  calculus.  Also,  for  several  discrete  event  systems,  the  model’s 
sensitivity  with  respect  to  its  input  parameters  can  be  calculated  using  some  perturbation 
analysis  technique.  Thus,  the  main  idea  behind  our  approach  is  to  adapt  (4.7)  to  account 
not  only  for  the  error  in  the  output  value,  but  for  the  error  in  the  sensitivity  as  well. 
Therefore,  the  neural  net  training  objective  function  becomes 


M  „  M  N 

E  =  2  ~  Vk )2  +  2  X]  ( Ski  ~  dki )2  '  (4-9) 

k= 1  k= 1  i=  1 


The  first  term  is  the  usual  error  term  used  in  standard  backpropagation  neural  networks. 
The  second  term,  is  the  error  in  the  sensitivity  of  the  neural  net  compared  to  the  sensitivity 
of  the  model.  Finally,  0  <  a  <  1  and  (5  =  1  —  a  are  weighting  factors  that  determine  the 
importance  to  be  associated  to  the  derivative  error.  Note  that  if  j3  =  0,  then  we  get  the 
standard  backpropagation  algorithm. 

Next,  if  we  are  interested  in  minimizing  the  error  function  of  (4.9),  we  need  to  determine 
the  neural  network’s  sensitivity  with  respect  to  its  inputs,  dki ■  For  the  3-layer  network  we 
consider  in  this  chapter,  this  is  done  by  the  chain  rule  of  differentiation  as  shown  below: 


,  _  dyk 

dki  r\ 

OXi 


/  /  TN\  d^k 


IN 


=  f  (ylN) 


dxi 

H 


f  {ylN)^2wjk]Z° 

3= 1 


'  dxi 


H 


dz]N 

dxi 


f  ( VkN )  wJkSf  (zjN ) 

3= 1 

f  (: vlN )  H  ( zjN )  uv 

3= 1 


(4.10) 


Subsequently,  if  we  want  to  use  gradient  based  techniques  to  minimize  the  error  function 
of  (4.9)  we  need  anci  jlEL  for  an  i,j,k  which  are  derived  next,  through  repetitive  use 
of  the  chain  rule  of  differentiation. 
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dE 

dwjK 


=  -a(tK  ~  Vk) 


dyk 

dwjK 
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*E 


1=1  L 


£Ki 


ddxi 

dwjK 


=  -a(tK  ~  Vk)?  {yW)  zj 


N 


-p  Y  ieKi  (f"  Uk)  ZJS(K’  *)  +  f  {vk)  9  {zjN)  Uij )]  (4.11) 
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where 

and 


Similarly, 


dE 

duu 


&ki  —  [pki  dki) 

H 

S(k,i)  =  YwE9  {zjN)  Uir 
j= i 


M 


-a  Y(fk  ~  Vk)f  (: VkN )  wJk9  {zjN)  x i 


k= 1 

M  N 


-*EE  [ejfei  (/"  (y™)  wJkg  (zjN)  xjS(k,  i ) 

^  2—  X 

+  /'  (yfc/iVW  (y//(4iV)w  +  y'(47V)))] 


(4.12) 


Note  that  these  expressions  get  considerably  simpler  when  the  activation  function  of  the 
output  layer  units  are  linear.  In  this  case,  f(x)  =  x,  f'(x)  =  1,  and  f"{x)  =  0.  Therefore, 


and 


dE 

dwjK 


N 

-a{tK  ~  Vk)zj  -  fig'  (. ZjN )  Y(SKi  ~  dKi)uij 
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(4.13) 
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(4.14) 


Finally,  the  weight  updates  at  every  iteration  t  are  given  by 


wjk{t+  1)  =  wjk(t)  -7 


dE 

dwjk 


and 


Uij(t+  1) 


Uij{t)  -  7 


dE 

duij 


(4.15) 


(4.16) 
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for  all  z  =  0,  •  •  •  ,  N,  j  =  0,  ■  •  •  ,  H,  k  =  1,  •  •  •  ,  M.  Where  7  is  the  learning  rate  and  and 
are  given  by  (4.11)  and  (4.12)  respectively  (Note  that  i  =  0  corresponds  to  the  bias  input 
of  a  neuron).  In  the  sequel,  this  will  be  referred  to  as  the  Derivative  Backpropagation  Neural 
Network  (DBPNN).  At  this  point,  it  is  worth  pointing  out  that  apart  from  the  learning  rate 
7  one  needs  to  determine  the  weight  to  be  given  to  the  derivative  error  /3  =  1  —  a.  This  is 
an  important  factor  that  may  affect  the  convergence  of  the  algorithm  as  discussed  later  in 
the  chapter. 


4.5  Numerical  Results 


In  this  section  we  present  some  results  that  show  the  benefit  of  using  the  DBPNN  training 
algorithm  described  in  equations  (4.15)  and  (4.16). 

In  our  first  experiment  we  try  to  approximate  the  function  y  =  x2  in  the  interval 
[—10,10].  For  this  experiment,  we  use  a  neural  network  with  20  hidden  units.  First,  we 
used  just  three  input-output  pairs  {(—10, 100),  (0, 0),  (10, 100)}  to  train  a  standard  back- 
propagation  neural  network  (BPNN).  Subsequently,  we  added  the  derivatives  at  these  three 
points  and  used  the  information  to  train  a  network  using  DBPNN.  For  this  experiment  we 
used  /?  =  0.1.  The  outputs  of  the  two  networks  as  well  as  the  target  output  function  are 
shown  in  Fig.  4.2.  As  seen  in  the  figure,  the  DBPNN  approximates  the  target  function  much 
better  than  the  standard  BPNN.  Also  shown  in  the  figure  is  the  absolute  value  of  the  ap¬ 
proximation  error  of  each  network  for  every  value  of  x  in  [—10, 10]  which  also  demonstrates 
the  benefit  of  DBPNN. 


x 


Figure  4.2:  Approximation  of  the  function  y  =  x2  using  NNs 
In  order  to  compare  the  generalization  ability  of  each  network,  we  integrate  the  area 
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under  the  error  curve  of  each  network  and  plot  it  in  Fig.  4.3  as  a  function  of  the  number  of 
points  used  during  the  training  of  the  two  networks.  As  seen  in  the  figure,  DBPNN  achieves 
much  better  generalization  than  standard  backpropagation  neural  net,  especially  when  the 
number  of  training  points  is  small. 


Number  of  Training  Points 


Figure  4.3:  Area  under  the  absolute  error  of  the  NN  approximations  for  y  =  x2 

Next  we  consider  an  M/M/1  queueing  network  where  we  are  interested  in  the  average 
time  that  customers  spend  in  the  system  S,  as  a  function  of  the  traffic  intensity  p.  For 
this  system,  the  IPA  algorithm  for  determining  0  is  given  in  [11].  Fig.  4.4  shows  the 
approximations  generated  by  BPNN  and  DBPNN  when  both  networks  have  20  hidden  units 
and  are  trained  with  only  5  train  points  and  for  DBPNN  /3  =  0.01.  As  seen  in  the  figure, 
DBPNN  again  achieves  a  much  better  generalization  than  the  standard  backpropagation 
network. 

Finally,  Fig.  4.5  shows  that  area  under  the  generalization  error  for  the  two  networks. 
Again,  DBPNN  achieves  a  much  better  generalization  than  BPNN  especially  for  a  small 
number  of  training  points. 

Note  that  for  the  second  experiment  we  have  set  the  (3  parameter  to  a  very  small  value 
(/?  =  0.01).  The  reason  is  that  as  p  approaches  1,  the  system  time  goes  asymptotically 
to  infinity  and  therefore  the  derivative  at  this  point  becomes  very  large.  As  a  result,  the 
derivative  error  for  this  point  dominates  the  entire  error  function,  so,  the  neural  network  in 
its  effort  to  minimize  the  total  error,  it  approximates  only  that  derivative  well,  but  does  not 
approximate  well  the  remaining  points.  Furthermore,  we  point  out  that  other  factors  may 
also  play  a  role  in  the  value  of  the  f3  parameter.  For  example,  if  we  have  noisy  estimates  of 
the  derivatives,  it  may  be  preferable  to  also  set  /3  to  a  small  value  in  order  to  avoid  noise 
from  taking  over  the  output  of  the  neural  network.  More  research  is  required  to  determined 
how  (3  is  set. 
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Figure  4.4:  Approximation  of  the  average  system  time  in  an  M/M/1  queueing  system 

4.6  Summary 


When  dealing  with  complex  systems,  simulation  is  usually  the  only  alternative  for  perfor¬ 
mance  evaluation  however,  it  is  notoriously  slow,  thus  the  need  for  “metamodels”.  Neural 
networks  are  considered  as  good  function  approximators  thus  make  good  metamodel  candi¬ 
dates.  However,  if  a  neural  net  is  to  adequately  learn  the  functional  relationship  between  the 
inputs  and  outputs  of  a  simulation  model  it  requires  a  significant  number  of  input /output 
pairs.  Since  such  information  can  only  be  obtained  through  simulation,  it  means  that  the 
training  phase  of  the  neural  network  will  be  long.  In  this  chapter,  we  investigate  the  use 
of  sensitivity  information  in  the  training  of  backpropagation  neural  network.  Some  pre¬ 
liminary  results  indicate  that  the  use  of  sensitivity  information  can  significantly  reduce  the 
number  of  training  input/output  pairs  required  which  in  turn  implies  that  the  nretanrodel 
development  phase  will  be  expedited. 
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Figure  4.5:  Area  under  the  absolute  error  of  the  NN  approximations  for  the  average  system 
time  in  an  M/M/1  System 
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Chapter  5 


HIERARCHICAL 
DECOMPOSITIONS  AND  THE 
CLUSTERING  APPROACH 


In  this  chapter,  we  discuss  the  use  of  clustering  methods  in  hierarchical  simulation  of  com¬ 
plex  systems  and  present  an  application  to  a  computer  security  problem.  First,  we  discuss 
the  basic  concepts  for  multi-resolution  simulation  modelling  of  complex  stochastic  systems. 
We  argue  that  high-resolution  output  data  should  be  classified  into  groups  that  match  un¬ 
derlying  patterns  or  features  of  the  system  behavior  before  sending  group  averages  to  the 
low-resolution  modules  to  keep  the  statistics  fidelity.  We  propose  high-dimensional  data 
clustering  as  a  key  interfacing  component  between  simulation  modules  with  different  reso¬ 
lutions  and  use  unsupervised  learning  schemes  to  recover  the  patterns  for  the  high-resolution 
simulation  results.  Subsequently,  we  give  the  examples  of  using  Hidden  Markov  Model  as  an 
effective  clustering  tool  for  this  task.  Next,  we  apply  the  clustering  techniques  in  the  context 
of  computer  security  and  give  some  examples  of  using  Hidden  Markov  Models  (HMMs)  for 
the  purpose  of  system  modelling  for  anomaly  detection. 


5.1  Introduction 


In  modelling  complex  systems  it  is  impossible  to  mimic  every  detail  through  simulation. 
The  common  approach  is  to  divide  the  whole  system  hierarchically  into  simpler  modules, 
each  with  different  simulation  resolution.  In  this  context,  the  output  of  a  module  becomes 
the  input  parameters  to  another,  as  illustrated  in  Fig.  5.1.  The  decomposed  modules  can 
be  high-resolution  or  low-resolution  models.  High-resolution,  e.g.  the  usual  discrete-event 
simulation  models,  take  detailed  account  for  all  possible  events,  but  are  generally  time  con¬ 
suming.  Low-resolution  modules  perform  aggregate  evaluation  on  the  module  functionality, 
i.e.,  determine  what  would  happen  “on  the  average”.  Such  modules  are  less  time  consuming 
and  can  be  any  of  the  following  components:  differential  equations,  standard  discrete-event 
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simulation,  and  fluid  simulation.  Furthermore,  the  decomposed  modules  can  also  be  an 
optimization  or  decision  supporting  tool. 


Figure  5.1:  Decomposition  of  complex  systems 

In  a  hierarchical  setting,  the  lower  level  simulator  (typically  with  a  high  resolution)  gen¬ 
erates  output  data  which  are  then  taken  as  input  for  the  higher  level  simulator  (typically 
a  low-resolution  model).  Hierarchical  simulation  is  a  common  practice,  but  the  design  of 
hierarchy  is  always  ad  hoc.  A  popular  practice  is  to  use  the  mean  values  of  variables  from 
the  lower  level  output  as  the  input  to  the  higher  level.  This  implies  that  significant  statis¬ 
tical  information  can  be  lost  in  this  process,  resulting  in  potentially  completely  inaccurate 
results.  Especially  when  the  ultimate  output  of  the  simulation  process  is  of  the  form  0  or 
1  (e.g.,“lose”  or  “win”  a  combat),  such  errors  can  provide  the  exact  opposite  of  the  real 
output. 

Quite  often,  the  system  being  simulated  is  such  that  the  high-resolution  model  produces 
so  widely  divergent  outputs  that  it  does  not  make  sense  to  summarize  such  output  through 
a  single  average  over  the  entire  sample  space.  In  such  cases,  we  must  subdivide  the  sample 
space  into  segments,  and  get  the  high-  resolution  model  to  produce  an  appropriate  input  to 
the  low-resolution  model  for  each  such  segment.  Essentially,  the  low-resolution  model  will 
be  broken  down  into  a  number  of  distinct  components,  one  for  each  segment  of  the  sample 
space.  To  carry  out  such  a  segmentation,  the  high-resolution  paths  first  need  to  be  grouped 
by  their  common  features.  These  features  then  determine  and  feed  the  corresponding  low- 
resolution  model.  This  grouping  procedure  is  essentially  an  unsupervised  classification 
procedure,  based  on  some  similarity  measure.  This  inspired  us  to  use  high-dimensional 
clustering  techniques  to  group  the  high-resolution  sample  paths  into  meaningful  clusters 
and  pass  on  to  the  lower  resolution  modules  with  the  statistical  fidelity  preserved. 
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5.1.1  Design  of  the  Interface 


A  systematic  design  and  analysis  framework  for  multi-resolution  complex  systems  is  defi¬ 
nitely  needed.  Here  we  present  some  fundamental  components  of  such  a  framework.  Our 
effort  is  directed  at  developing  an  interface  between  two  simulation  levels  to  preserve  sta¬ 
tistical  fidelity  to  the  maximum  extent  that  available  computing  power  allows.  In  a  typical 
hierarchical  simulation  model,  the  lower  lever  consist  of  a  high-resolution  model,  such  as 
the  discrete  event  simulator  that  generates  several  sample  paths  given  some  input  param¬ 
eters  u.  The  output  of  such  simulation  models  is  then  used  as  input  to  the  higher  level 
model  (typically  a  low-resolution  model).  The  question  is  how  much  and  what  information 
we  need  to  pass  from  the  high-resolution  to  the  low-resolution  model  such  that  statistical 
fidelity  is  preserved. 

Note  that  each  sample  path  generated  by  the  high-resolution  model  is  also  a  func¬ 
tion  of  some  randomness  u  (a  random  number  sequence  generated  through  some  random 
seed).  Thus,  any  function  evaluated  over  an  observed  sample  path  (e.g.,  h(u,u)  is  also  a 
random  variable.  Typically,  we  are  not  interested  in  the  value  of  h( u,u)  obtained  from 
a  single  sample  path  but  rather  the  expectation  E{h( u,  w)}.  Based  on  this,  in  hierar¬ 
chical  simulation  it  is  customary  to  use  E{h(u,u:)}  as  an  input  parameter  to  the  higher 
level  model  as  seen  in  Fig.  5.2.  This  is  often  highly  unsatisfactory,  since  the  mean  often 
obscures  important  features  of  the  high-resolution  output.  Said  in  another  way,  we  are 
seeking  E{L(h(u,uj))},  where  L(-)  is  a  function  corresponding  to  the  low-resolution  model, 
but  what  we  end  up  evaluating  by  passing  a  single  average  is  L(E{h( u,tc)});  however,  in 
general  E{L(h(u,  cu))}  /  L(E{h(u,uj)}). 


Parameter:  a 


LOW-RESOLUTION 
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L(  a) 


x  =  L(a) 


r 

INTERFACE 

\ 

Averaging: 

h(u,C0\),...,  h(u,Ct)N ) 

a  =  E{  h(u,0) )  } 

) 

INPUT:  u 


RANDOMNESS:  0) 


HIGH-RESOLUTION 

MODEL 

h(u,CO) 


N  simulated 
scenarios 


Figure  5.2:  Hierarchical  model  interface:  passing  a  simple  average  to  the  lower  resolution 
model 

To  solve  this  problem  we  propose  the  use  of  clustering  to  identify  groups  of  sample 
paths  that  have  some  “common  features”,  and  therefore,  when  averaged  together  do  not 
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cause  the  loss  of  too  much  information.  This  approach  in  shown  in  Fig.  5.3.  From  the  N 
observed  sample  paths  we  identify  m  <  N  groups  that  share  some  common  features  and 
determine  m  input  parameters  ai,...,am  where  a ,  =  E{hl(u,u)}  and  hl{-)  identifies  all 
sample  paths  in  cluster  i.  Subsequently,  each  parameter  a*  is  used  as  an  input  to  a  lower 
resolution  model  and  finally  we  obtain  E{L(ai)}  over  m  low-resolution  components,  which 
we  claim  is  a  better  estimate  of  the  overall  system  output  than  the  one  obtained  using  a 
single  average. 


Figure  5.3:  Hierarchical  model  interface:  passing  several  averages  to  the  lower  resolution 
model,  one  for  each  cluster 

One  may  pose  the  following  question:  Since  the  desired  output  is  of  the  form  E{L(h(u,  cu))}, 
why  bother  with  clustering  at  all  when  we  can  evaluate  L(E{h( u,cu)})  for  all  N  obtained 
samples  and  then  perform  the  required  expectation,  especially  since  the  low-resolution  mod¬ 
els  are  generally  easy  to  evaluate?  The  answer  to  this  question  lies  in  the  derivation  of  the 
low-resolution  model.  Typically,  L( a)  assumes  that  a  is  an  expectation  and  therefore  it 
would  be  meaningless  to  use  some  quantity  obtained  from  a  single  sample  path. 


5.1.2  Clustering  Tools 

Based  on  the  above  design  principles  of  interface  for  the  multi-resolution  simulation  frame¬ 
work,  we  have  proposed  two  different  types  of  clustering  tools,  i.e.,  ART  based  on  Neural 
Networks,  and  HMM  based  on  stochastic  dynamics. 


ART  neural  networks  were  developed  by  Carpenter  and  Grossberg  [10]  to  understand 
the  clustering  function  of  the  human  visual  system.  They  are  based  on  a  competitive 
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learning  scheme  and  are  designed  to  deal  with  the  stability /plasticity  dilemma  in  clustering 
and  general  learning.  ART  neural  networks  successfully  resolve  this  dilemma  by  matching 
the  input  pattern  with  the  prototypes.  If  the  matching  is  not  adequate,  a  new  prototype 
is  created.  In  this  way,  previously  learned  memories  are  not  eroded  by  new  learning.  In 
addition,  the  ART  neural  network  implements  a  feedback  mechanism  during  learning  to 
enhance  stability.  Our  experiments  [23,  13]  of  using  ART  neural  networks  with  combat 
simulation  paths  have  been  quite  successful.  We  believe  further  improvement  with  the 
ART  structure  can  lead  to  a  fundamental  breakthrough  in  large  data  clustering,  which  is 
needed  in  complex  systems  modelling.  We  have  also  proposed  a  heuristic  that  allows  the 
magnitude  of  the  input  pattern  to  play  a  role  in  the  clustering  function.  Furthermore, we 
are  developing  a  generic  numerical  clustering  tool,  based  on  the  ART  neural  network,  that 
can  be  used  for  many  important  problems  in  intelligent  data  analysis. 

In  general,  the  description  of  a  typical  sample  path  generated  by  a  discrete-event  system 
requires  a  large  amount  of  data  since  such  sample  paths  are  typically  quite  long.  This  implies 
that  the  dimension  of  each  input  pattern  will  also  be  large.  However  for  high  dimensional 
data  most  of  the  clustering  algorithms  (including  ART)  will  involve  huge  computational 
effort;  thus  they  are  not  practical  for  simulation  modelling  purposes.  For  this  reason  we 
develop  a  new  clustering  approach  where  we  try  to  take  advantage  of  the  statistical  structure 
behind  a  typical  sample  path.  For  high  dimensional  complex  systems  we  try  to  use  a  Hidden 
Markov  Model  (HMM),  which  has  been  successfully  used  in  speech  recognition  and  other 
areas  [69]  to  characterize  each  observed  sample  path.  In  our  approach,  we  use  an  HMM 
to  describe  an  arbitrary  sample  path  and  thus  we  cluster  together  all  sample  paths  whose 
corresponding  HMMs  have  a  high  similarity  measure  (to  be  defined  in  Section  5.2).  The 
advantage  of  this  approach  is  that  the  amount  of  data  required  to  describe  an  HMM  is 
generally  much  smaller  than  the  amount  of  data  required  to  explicitly  describe  an  observed 
sample  path  and  as  a  result  the  HMM  approach  is  more  efficient. 


5.2  Hidden  Markov  Model  as  a  Clustering  Tool 

A  sample  path  generated  by  a  discrete-event  system  consists  of  a  sequence  (e*,,^),  k  = 
1,2, . . ., K,  where  denotes  the  kth  event  and  tk  its  corresponding  occurrence  time.  For 
typical  systems,  the  number  of  observed  events  K  is  very  large  and  thus  attempting  to 
cluster  sample  paths  “directly”,  i.e.,  by  making  explicit  use  of  the  entire  event  sequence 
(efc,  tk),  requires  the  input  vectors  to  be  of  a  very  large  dimension  which  has  an  adverse  effect 
on  the  computational  requirements  of  most  clustering  tools  (including  ART).  To  address 
this  problem,  we  observe  a  sample  path  that  is  generated  by  an  arbitrary  system  and  try  to 
describe  it  by  some  Markov  Chain,  and  thus  we  use  the  theory  of  Hidden  Markov  Models 
(HMMs)  to  identify  its  parameters.  Once  we  identify  the  HMM  parameters  we  define  a 
similarity  measure  among  each  obtained  HMM  and  cluster  together  all  sample  paths  with 
the  largest  similarity.  The  advantage  of  this  approach  is  that  the  amount  of  information 
required  to  describe  an  HMM  is  considerably  less  than  the  amount  of  information  required 
to  describe  a  sample  path.  Even  though  the  identification  of  the  HMM  parameters  requires 
some  additional  computational  overhead,  our  experiments  have  shown  that  in  overall,  the 
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HMM  approach  is  considerably  faster  than  direct  clustering  approaches.  Incidentally,  we 
point  out  that  this  approach  makes  no  a  priori  assumptions  about  the  statistical  distribution 
of  the  data  to  be  analyzed. 


5.2.1  Experimental  Design 

Next,  we  demonstrate  the  HMM  clustering  approach  through  an  example.  For  the  purposes 
of  our  example,  we  assume  that  we  have  three  systems  Si  ,S%  and  S3.  When  simulated, 
each  system  generates  sample  paths  Qtj ,  i  =  1,2,3 ,j  =  1,2,...,  where  Qij  corresponds  to 
the  jth  sample  path  generated  by  system  Si.  When  clustering  sample  paths,  it  would  be 
reasonable  to  expect  that  sample  paths  generated  from  the  same  system  are  grouped  in  the 
same  cluster. 

In  this  example,  we  generate  9  sample  paths,  3  from  each  system  and  develop  a  way 
to  distinguish  between  sample  paths  obtained  from  different  systems.  To  achieve  this,  we 
first  associate  an  HMM  A*  =  (Hj,Hj,7Tj)  to  each  sample  path  i  =  1,...,9,  where  we  use 
the  notation  of  Rabiner  [69].  Ai  denotes  the  state  transition  probability,  P*  denotes  the 
observation  symbol  probability  at  every  state  and  7Tj  denotes  the  initial  state  distribution. 
We  assume  that  it  consists  of  N  states  and  that  for  each  state  we  can  observe  any  of  the 
M  possible  symbols. 

To  construct  the  three  systems  Si,i  =  1,2,3,  we  assume  that  they  consist  of  a  Markov 
Chain  with  TV,;  states  (N±  =  20,  IV2  =  10,  and  N3  =  10)  and  randomly  generate  a  state 
transition  probability  matrix  Pi  =  \p\i\,k,l  =  1,2,...,  iV^.  Furthermore,  we  randomly 
generate  the  parameters  =  1,2, ...  ,Ni,  so  that  the  exponentially  distributed  sojourn 
time  for  state  k  has  mean  1/ However,  not  all  real  systems  are  memoryless,  therefore, 
to  make  our  example  more  interesting  we  introduce  some  “special”  states,  where  state 
transitions  out  of  such  states  are  not  made  according  to  the  state  transition  probability  but 
rather  through  the  set  of  rules  we  describe  next. 

For  S\,  we  assume  that  states  1, . . . ,  5  are  special  states  defined  by  the  following  rules: 


•  State  1:  System  stays  at  state  1  for  either  1  or  3  consecutive,  exponentially  distributed 
sojourn  time  intervals,  each  with  mean  1  / ^\.  Then  it  jumps  to  state  10. 

•  State  2:  System  stays  at  state  2  a  deterministic  sojourn  time  interval  of  length  2/^1, • 
Then  it  jumps  to  state  n  according  to  the  state  visited  before  arriving  at  state  2, 
denoted  by  S-i: 


f  15  if  ST,  g  {1, . . . ,  10} 

\  4  if  S_i  e  {11,..., 20} 


(5.1) 


•  State  3:  System  stays  at  state  3  for  either  3  or  5  consecutive  sojourn  time  intervals, 
each  exponentially  distributed  with  mean  1/ n\-  Then  it  transfers  according  to  state 
transition  probability  matrix  Pi. 
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•  State  4:  System  stays  at  state  4  for  either  4  or  1  exponentially  distributed  sojourn  time 
intervals  with  mean  1/ n\.  Then  it  jumps  to  state  n  according  to  the  state  previously 
visited  S-i: 


6  if  S__i  €  {1, . . . , 5} 

11  if  S-i  €  {6,...,  in 

77  — 

16  if  S-x  €  {11,..., 
kl  if  S-!  G  {16, , 

•  State  5:  System  stays  at  state  5  for  a  deterministic  sojourn  interval  of  length  1  /  fi\ 
,and  then  transfers  according  to  the  state  transition  probability  matrix  P\. 


Both  S2  and  S3  have  only  2  special  states  defined  by  the  following  rules: 


•  State  1:  System  stays  at  state  1  or  2  sojourn  time  interval,  each  of  which  is  exponen¬ 
tially  distributed  with  mean  1  /  n\,i  =  2,3.  Then  it  transfers  to  state  n  according  to 
the  previous  state  visited: 


n  = 


7  if  5_i  €{1,2} 

9  if  S_i  €  {3} 

5  if  5_i  €  {4, . . , ,  10} 


(5.3) 


•  State  2:  System  stays  at  state  2  or  a  deterministic  amount  of  time  equal  to  1//4,  *  = 
2,3,  and  then  transfers  to  state  5. 


Using  Si,  S2  and  S3,  we  generate  9  sample  paths  (3  from  each  system)  and  for  each 
sample  path  we  estimate  the  HMM  model  parameters  A .j,j  =  1 , . . . ,  9,  to  maximize  the 
probability  that  the  jth  observed  sample  path  was  obtained  from  Xj.  This  is  referred  to  as 
the  training  problem  and  is  tackled  by  repeatedly  solving  what  is  described  as  “Problem  3” 
in  Rabiner  [69].  For  the  purposes  of  this  experiment,  we  assume  that  each  HHM  consists  of 
N  =  6  states.  In  addition,  we  assumed  that  the  actual  state  visited  by  each  of  the  systems 
is  not  observable.  Rather,  the  observation  symbols  at  each  state  are  the  state  holding 
times.  These  can  generally  take  any  positive  values.  To  determine  Bi,  the  symbol  output 
probability,  we  quantize  all  possible  values  into  M  =  64  intervals. 

Once  we  determine  the  HMM  parameters  for  all  sample  paths,  A \,i  =  1, . . . ,  9,  we  use 
the  similarity  measure  defined  in  Rabiner  [69]  to  determine  which  HMMs  and  consequently 
which  sample  paths  are  sufficiently  similar  so  that  they  can  be  clustered  together.  The 
similarity  measure  is  defined  for  any  pair  of  HMMs  A  $  and  Xj  as: 

cr(Xi,  Xj)  =  exp{D(Xi,  Xj)}  (5.4) 


where 


D(Xi,  Xj) 


log  Pr{Oj\Xi)  +  log  Pr(Ol\Xj)  —  log  Pr(Ol\Xi)  —  log  Pr(0J\Xj) 

2  TI< 


(5.5) 
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HMM1 

HMM2 

HMM3 

HMM4 

HMM5 

HMM6 

HMM7 

HMM8 

HMM9 

HMM1 

1 

0.760 

0.769 

0.776 

0.950 

0.950 

0.798 

0.804 

0.794 

HMM2 

0.760 

1 

0.940 

0.949 

0.772 

0.776 

0.837 

0.839 

0.835 

HMM3 

0.769 

0.940 

1 

0.947 

0.777 

0.785 

0.850 

0.847 

0.847 

HMM4 

0.776 

0.949 

0.947 

1 

0.787 

0.793 

0.847 

0.844 

0.844 

HMM5 

0.950 

0.772 

0.777 

0.787 

1 

0.951 

0.799 

0.804 

0.799 

HMM6 

0.950 

0.776 

0.785 

0.793 

0.951 

1 

0.815 

0.820 

0.809 

HMM7 

0.798 

0.837 

0.850 

0.847 

0.799 

0.815 

1 

0.945 

0.943 

HMM8 

0.804 

0.839 

0.847 

0.844 

0.804 

0.820 

0.945 

1 

0.950 

HMM9 

0.794 

0.835 

0.847 

0.844 

0.799 

0.809 

0.943 

0.950 

1 

Table  5.1:  Similarity  measure  between  HMMs  corresponding  to  each  of  the  9  sample  paths 


is  what  Rabiner  [69]  called  the  distance  measure.  Pr(Ol\\j )  is  the  probability  of  the  obser¬ 
vation  sequence  O'1,  i.e. ,  the  sequence  of  state  holding  times  that  corresponds  to  the  sample 
path  Qi,  was  generated  by  HMM  A j.  For  computational  convenience,  we  break  any  sample 
path  into  K  segments  of  length  T  and  thus  compute 

I< 

log  Pr{0^\\i)  =  ^  log /MO*]  A,)  (5.6) 

k= l 

where  Pr(0\\\j)  is  the  probability  of  the  kth.  subsequence  of  sample  path  i  was  generated  by 
HMM  A  j.  Also,  note  that  the  similarity  measure  is  symmetric,  that  is  cr(Aj,  A  j)  =  cr(\j,  A  j); 
a  desired  property  for  a  good  similarity  measure. 


5.2.2  Experiments  Results 

In  the  similarity  results  shown  in  Table  5.1,  the  length  of  each  of  the  9  sample  path  is 
10,000  events.  In  addition,  the  parameters  ,  i  =  1,2,3,  j  =  1,. ..,  IVj  are  generated  such 
that  1  / nlj  are  uniformly  distributed  between  4  and  50.  Sample  paths  Q\,  Q§,  and  Q$  are 
generated  by  Si.  Q 2,  Q 3,  and  Q 4  are  generated  by  S2  while  Q 7,  Qg,  and  Qq  are  generated 
by  S3. 

Finally,  we  cluster  together  all  sample  paths  that  correspond  to  HMMs  with  similarity 
measure  greater  than  a  threshold  value  V.  Note  that  V  corresponds  to  the  required  degree 
of  similarity  for  two  sample  paths  to  be  clustered  together.  For  example,  if  V  =  0.9,  then 
the  similarity  measures  exceeding  V  are: 

Cluster  1:  <r(l, 5), <r(l, 6), <r(5, 6) 

Cluster  2:  u(2, 3), <r(2, 4), <r(3, 4) 

Cluster  3:  a (7, 8),  a( 7, 9),  <7(8,  9) 

The  resulting  three  clusters  correspond  to  the  generative  models  S'i  .S2  and  S3  respec¬ 
tively.  Therefore,  HMM  has  successfully  classified  all  sample  paths. 
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5.3  Application  of  Clustering  in  Computer  Security 

5.3.1  Motivation 

Computer  networks  are  complex  systems  which  consist  of  multiple  components.  The  inter¬ 
action  between  human  and  computers,  between  servers  and  clients,  between  hardware  and 
software,  all  contributes  to  the  resulting  complicated  hierarchical  structure  of  computer 
networks,  either  viewed  physically  or  logically.  Computer  security  defense,  or  intrusion 
detection  in  particular,  aiming  at  identifying  malicious  activities  performed  by  outside  at¬ 
tackers  or  insider  abusers  by  analyzing  the  behavior  of  computer  networks  via  observable 
audit  events,  definitely  needs  the  coordination  of  multiple  detection  sensors  and  compo¬ 
nents.  This  naturally  leads  to  the  application  of  the  general  framework  of  modelling  and 
simulation  of  multi-resolution  complex  systems  to  computer  security. 

Intrusion  activities  in  order  to  gain  unauthorized  access  to  system  resources  and  file 
systems  should  leave  traces  at  different  levels  of  the  system,  but  not  necessarily  significant 
at  all  levels.  For  example,  the  intrusion  activity  may  appear  normal  at  some  levels  of 
monitoring,  and  be  distinct  at  another  level.  Hierarchical  structural  models  are  able  to 
integrate  both  local  and  global  information  to  make  more  accurate  judgments. 

We  propose  to  apply  clustering  methods  in  system  modelling  for  computer  security. 
This  is  based  on  the  observations  of  a  hierarchical  structure  in  both  security  audit  data  and 
user  behavior,  in  which  audit  events  from  multiple  sources  with  multiple  resolutions  are 
collected  and  analyzed.  In  this  setting,  clustering  may  help  in  characterizing  the  hierarchical 
structure  for  a  better  view  and  understanding  on  the  system  behavior,  as  it  can  in  other 
multi- resolution  complex  systems. 

Clustering  is  an  unsupervised  classification  procedure,  where  the  input  data  to  the 
clustering  algorithm  are  unlabelled,  so  it  is  not  feasible  to  us  to  use  it  directly.  Rather,  we 
need  to  integrate  domain  knowledge  as  much  as  possible  when  clustering  is  introduced  into 
the  security  system  modelling. 


5.3.2  Hierarchy  in  Computer  Security  Systems 

Hierarchical  structures  are  observed  extensively  in  computer  security  systems,  such  as  in  [3] 
and  [72] .  We  believe  that  both  normal  and  abnormal  system  behavior  should  be  described 
by  integrating  data  from  multiple  sources  at  multiple  levels. 


Multi-resolution  audit  data 

For  security  monitoring  and  intrusion  detection,  audit  events  of  multiple  resolution  are 
collected  from  multiple  sources  as  following: 

•  system  call  traces:  collected  from  operating  system  kernel.  They  are  the  lowest 
level  host-based  audit  data,  and  describe  the  execution  of  programs. 
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•  network  packets:  collected  from  local  network.  They  are  the  lowest  level  network- 
based  audit  data,  describing  the  network  traffic  in  and  out  the  host. 

•  command  history  data:  collected  from  shell  or  process  table  from  operating  sys¬ 
tem.  They  reside  on  the  level  above  the  system  call  traces,  describing  the  observable 
behavior  of  users. 

•  system  events  log:  a  mixture  of  multi-resolution  events,  including  messages  or  errors 
from  applications  and/or  network  protocols. 


Multi-level  system  architecture 

Based  on  the  hierarchical  structure  of  audit  events,  we  describe  the  hierarchical  structure 
of  a  multi-level  host-based  system  model.  For  simplicity,  we  only  consider  the  setting  of  a 
single  host,  i.e. ,  we  assume  that  all  audit  data  are  coming  from  a  single  monitored  host. 
The  five  levels  from  the  top  to  the  bottom  are  host,  user,  operation,  command,  and  system 
call.  This  structure  is  briefly  illustrated  in  Fig  5.4,  where  operation  level  (not  shown)  is 
overlapping  with  user  level. 
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Figure  5.4:  The  hierarchical  structure  of  computer  security  models 


1.  Host  level:  The  top  level  in  our  architecture.  In  essence,  the  behavior  of  a  host 
is  characterized  by  three  aspects  of  behavior:  program  behavior,  user  behavior  and 
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network  traffic  behavior.  In  practice,  we  may  observe  the  following  aspects  of  system 
behavior: 

•  System  Configuration:  Change  of  system  configuration  may  be  an  indication  of 
anomaly. 

•  File  Integrity:  File  integrity  check  provides  valuable  information  on  what  really 
happened  on  the  host 

•  Traffic  Density:  Dramatic  change  of  network  traffic  density  may  suggest  anomaly 
activities  happening  on  the  system. 

•  Resource  Utilization:  For  each  aspects  of  system  resource  such  as  CPU,  memory 
and  file  access,  we  can  record  down  the  density  of  usage  periodically.  Using  these 
records,  we  can  construct  a  probability  distribution  of  short-term  and  long-term 
behaviors  for  each  of  the  resources. 

•  User  access  statistics:  For  each  user  who  have  access  to  the  host,  we  can  record 
the  normal  usage  frequency  and  time  schedule  of  users,  and  then  construct  a 
probability  distribution  over  the  user  ID. 

•  Process  statistics:  For  one  given  host,  the  statistics  of  process  running  over  a 
period  of  time  can  be  obtained.  We  can  construct  a  probability  distribution  over 
the  process  name/group. 

•  System  call  statistics:  For  one  given  host,  the  statistics  of  system  calls  called 
over  a  period  of  time  can  be  obtained.  For  each  (or  each  group)  of  system  calls, 
we  can  then  construct  the  probability  distribution  over  their  name/functionality. 

The  above  features  are  extracted  from  different  levels  of  audit  data.  If  we  describe 
the  behavior  of  a  host  by  them  and  compute  a  numerical  index,  then  a  significant 
deviation  of  this  index  for  a  given  period  of  time  may  result  from  anomaly  behavior. 

2.  User  level:  Normally,  intrusion  is  carried  out  directly  by  a  user  or  by  a  process 
automatically  run  from  a  user’s  computer.  The  data  at  this  level  is  user  history  data 
of  commands  and  system  calls  collected  on  a  single  host.  Sometimes  it  is  difficult  to 
distinguish  two  users  who  have  similar  work  schedules  and  similar  temporary  tasks. 
So  it  may  be  useful  to  group  the  users  by  their  main  activities,  such  as  the  group 
for  programming,  the  group  for  system  administration,  etc.  Users  can  also  be  put 
into  groups  according  to  their  proficiency,  such  as  beginning  learners,  standard  users 
and  innovative  experts.  Actually,  this  level  sometimes  interleaves  with  the  Operation 
Level,  as  we  will  see  below. 

3.  Operation  level:  This  level  is  also  called  activity  level.  It  can  be  considered  as 
a  sub  layer  of  the  user  level,  or  the  layer  parallel/overlapping  with  the  user  level. 
When  working  on  a  computer,  each  user  may  do  one  or  more  of  the  following  tasks: 
Programming,  System  administration,  Network  utilities,  Entertainment,  Idle,  etc. 
Normally,  users’  activities  may  interleave  with  each  other,  so  it  is  not  easy  to  recognize 
each  distinct  activity  stream  from  a  user’s  mixed  command  trace. 

4.  Command  level:  This  level  is  equivalent  to  the  program/functionality  level,  which 
is  above  the  system  call  level  and  is  the  basic  interface  between  the  user  and  the  host. 
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People  suggests  that  profiling  functionality  or  program  execution  is  more  effective 
than  profiling  individual  users.  The  fewer  number  of  functionality  profiles  are  much 
more  stable  than  users’  profiles. 

5.  System  call  level:  This  level  is  also  called  Module  Level,  which  is  the  lowest  level 
of  system  behavior  above  the  machine  code.  This  level  is  the  interface  between  the 
user  application  and  the  kernel  of  the  operating  systems.  In  Linux  system,  there  are 
nearly  200  system  calls.  In  our  analysis,  they  each  can  be  treated  as  a  distinct  state, 
or  they  can  be  grouped  by  their  functionality.  Statistics  or  structural  models  can  be 
obtained  at  this  level. 


With  this  hierarchical  structure  in  mind,  we  need  clustering  for  Data  Reduction  and 
Event  Correlation,  to  get  representative  models  for  characterizing  normal  profiles. 


5.3.3  Characterizing  the  Hierarchical  Structure 

In  this  section,  we  discuss  three  different  types  of  models  used  in  computer  security  systems 
for  the  purpose  of  intrusion  detection.  We  argue  that  the  effectiveness  of  these  models  rely 
on  the  design  of  interfaces  between  high-resolution  and  low-resolution  modules. 


Modelling  Program  Behavior 

One  way  to  characterize  the  system  behavior  is  to  concentrate  on  the  program  behavior, 
because  it  is  the  abnormal  program  execution  branches/exceptions  that  essentially  leads  to 
the  unauthorized  access  or  abuse  of  the  system. 

To  describe  program  behavior,  we  need  to  look  into  lower  level  audit  data,  i.e.,  the 
execution  traces  of  system  calls.  In  this  case,  system  call  traces  are  generated  by  high- 
resolution  sensors,  while  the  representative  models  of  programs  are  of  lower  resolution. 
This  corresponds  to  the  interface  between  “command”  and  “system  call”  level  as  shown  in 
Fig  5.4.  The  task  here  is  to  summarize  the  data  of  system  call  traces  into  representative 
models  of  programs. 

Various  modelling  schemes  have  been  proposed  to  handle  this  problem,  including  simply 
pattern  matching  (either  fixed- length  or  variable-length),  Markov  Chain  Models,  Decision 
Tree,  Probabilistic  Networks,  Machine  Learning  models,  Neural  Networks,  Hidden  Markov 
Models,  etc. 


Modelling  User  Behavior 

Another  way  to  characterize  system  behavior  is  to  look  at  the  behavior  of  individual  users 
based  on  the  fact  that  it  is  the  users  who  are  interacting  with  the  computer,  performing 
normal/abnormal  activities.  This  corresponds  to  the  interface  between  “user”  level  and 
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“command  level”,  where  the  high-resolution  view  is  command  history  data,  and  the  low- 
resolution  view  is  the  profile  of  user  behavior. 

There  is  rich  information  within  command  history  data,  such  as  the  time  stamp  of 
the  command,  command  name,  switches,  and  arguments.  People  usually  only  look  at  the 
sequence  of  command  names  and  build  models  similar  to  the  modelling  of  program  behavior. 
However,  command  history  data  have  their  own  special  characteristics  which  call  for  distinct 
treatment  rather  than  being  treated  the  same  as  system  call  sequences. 

Dealing  with  this  set  of  high-dinrensional  data  is  not  easy,  especially  with  data  fields 
of  hybrid  types.  One  way  to  deal  with  it  is  to  treat  each  dimension  of  the  information 
separately  with  different  models  chosen  specific  for  each  dimension.  After  decision  results 
are  generated  from  each  of  the  dimensions,  we  integrate  them  together  to  form  the  final 
decision.  Another  way  to  deal  with  it  is  try  to  define  some  integrated  index  to  incorporate 
all  of  the  information  into  a  single  vector  for  training  and  testing. 

People  have  proposed  modelling  user  behavior  through  command  history  data  by  differ¬ 
ent  models,  such  as  pattern  matching,  machine  learning,  Markov  models,  high-order  Markov 
models,  uniqueness  models,  etc.  There  are  some  inherent  difficulties  within  this  data  set 
that  set  limitations  on  the  performance  of  those  models  in  intrusion  detection,  such  as  the 
high  randomness  within  command  history  data,  concept-drift  problem  addressing  the  shift 
of  behavior  over  time,  etc.  A  hybrid  model  taking  advantage  of  both  statistical-based  com¬ 
ponents  and  signature-based  components  may  be  the  possible  solution  to  this  problem  for 
better  performance. 

Above  “user  level” ,  we  may  add  another  level,  “user  group  level” .  We  build  this  level  by 
doing  clustering  on  user  behavior  models,  and  examine  the  commonty  among  users.  User 
groups  can  thus  be  constructed  with  users  of  similar  background,  task  domain,  proficiency 
and  working  schedules.  Such  groups  are  indeed  observed  in  user  command  history  data. 
Here  clustering  is  used  for  getting  the  representative  models  for  user  group. 

Actually,  we  could  insert  another  level,  “activity  level”  above  “command  level”,  where 
clustering  may  be  used  for  extracting  the  models  for  different  type  of  user  activities.  Mod¬ 
els  can  be  built  individually  for  different  activities,  such  as  editing  documents,  checking 
email,  system  configuration  and  maintenance,  entertainment  activity,  etc.,  to  investigate 
the  underlying  dynamics  below  the  command  sequences.  These  streams  of  activities  may 
interleave  with  each  other  if  viewed  from  command  history  data.  It  remains  a  question 
whether  we  can  find  good  representative  models  for  those  activities. 


Modelling  Network  Traffic  Behavior 

The  third  angle  of  view  on  the  system  behavior  is  through  the  behavior  of  network  traffic. 
There  are  many  network  protocols  and  services  running  in  current  computer  network  setting, 
such  as  http,  ftp,  telnet,  SMTP,  SNMP,  ICMP,  DNS,  etc.  Without  the  communications 
built  by  network  protocols  and  the  vulnerabilities  in  the  design  and  implementation  of  them, 
most  intrusions  could  never  be  carried  out.  Analyzing  the  profiles  of  network  traffic  data 
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is  especially  important  in  dealing  with  Distributed  Denial  of  Service  (DDoS)  attack  which 
has  become  the  most  serious  threat  to  Web  servers  on  Internet. 

In  this  context,  the  high-resolution  data  are  the  network  packets,  and  the  low-resolution 
models  are  describing  the  behavior  of  different  network  protocols  or  different  network  op¬ 
eration  scenarios.  This  interface  is  not  shown  in  Fig  5.4,  but  it  is  critical  to  the  successful 
coordination  among  different  computers  on  the  network  to  defense  attacks. 

People  have  proposed  different  techniques  to  modelling  network  traffic  data  to  describe 
the  normal  profiles  of  network,  to  support  IP  Traceback,  to  detect  stepping-stones  for  DDoS 
attack,  to  characterizing  Worm  propagation,  etc. 


5.3.4  Related  Work 

Clustering  has  been  used  in  user  modelling  for  anomaly  detection.  In  [52],  T.  Lane  proposes 
user  modelling  based  on  pattern  matching  on  short  segments  of  command  history  data.  In 
this  model,  frequently  used  short  segments  of  commands  extracted  from  normal  data  consti¬ 
tute  the  normal  profiles.  For  efficiency  in  both  storage  and  matching,  he  uses  clustering  for 
data  reduction  on  the  patterns  stored  in  normal  database.  In  [58],  J.  Marin  et  al.  proposed 
a  hybrid  model  to  profile  user  behavior  by  the  relative  frequency  of  command  usage  stored 
in  vectors.  They  use  k-means  clustering  for  generating  the  initial  reference  vector. 

Clustering  has  been  shown  to  help  in  both  data  reduction  and  model  representation. 
Clustering  may  also  be  used  for  event  correlation  when  events  coming  from  multiple  sensors 
are  gathered  together  for  a  comprehensive  view  on  what  is  going  on  in  the  system.  In  our 
study  shown  next,  we  propose  to  use  Hidden  Markov  model  as  a  clustering  tool  at  the 
interface  between  high-resolution  audit  data  and  low-resolution  audit  data. 


5.4  Experiments  on  HMM  for  Anomaly  Detection 

5.4.1  Applications  of  HMM  in  Computer  Security 

We  have  shown  that  clustering  may  help  in  the  modelling  of  system  behavior  for  computer 
security,  and  HMM  is  a  powerful  candidate  to  do  the  clustering  job.  So  here  we  will 
examine  the  possible  applications  of  Hidden  Markov  Models  in  computer  security.  Due 
to  its  strong  descriptive  power,  HMM  can  either  be  used  for  modelling  user  behavior,  or 
program  behavior.  In  [81],  Hidden  Markov  Models  are  used  for  modelling  program  behavior 
through  the  execution  traces  of  system  call  sequences  and  the  performance  is  compared  with 
some  other  simple  models.  The  results  show  that  Hidden  Markov  Models  are  powerful  to 
describe  the  behavior  of  normal  program  in  the  context  of  anomaly  detection,  at  the  cost 
of  computational  burden  in  training.  In  [51],  T.  Lane  chooses  Hidden  Markov  Models  for 
profiling  normal  user  behavior  represented  by  Unix  shell  command  history  sequences,  where 
again  HMMs  are  shown  to  be  able  to  characterize  the  complicated  structure  within  user 
command  sequences. 
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On  the  other  hand,  HMM  can  be  used  in  multiple  levels  of  modelling,  depending  on  the 
resolution  of  view.  For  example,  it  can  be  used  for  building  a  model  for  sample  paths  which 
are  mixtures  of  multiple  activities,  or  it  can  be  built  based  on  a  specific  activity  that  are 
previously  extracted  from  the  sample  path. 

We  have  done  some  experiments  where  we  apply  Hidden  Markov  Models  to  model  the 
normal  behavior  of  network  traffic  log  and  use  this  model  as  the  reference  to  detect  any 
anomalous  behavior  in  the  network  log  data.  Next  we  will  show  some  considerations  during 
model  construction,  and  give  brief  discussions  on  the  experiments. 


5.4.2  Model  Construction 

A  set  of  experiments  were  performed  on  SIAC  company  security  log  data  using  Hidden 
Markov  Models.  The  purpose  of  the  experiments  was  to  use  Hidden  Markov  Models  to 
characterize  the  normal  behavior  of  the  system  traffic  represented  by  audit  event  “sample 
path”,  so  that  the  traffic  caused  by  intrusive  activities  showing  a  large  diverge  from  the 
normal  model  will  be  captured  by  the  clustering  procedure. 

In  SIAC  log  data,  each  line  is  a  log  event.  One  connection  (such  as  an  http  connection) 
can  generate  one  or  more  log  events.  Most  of  the  connections  (more  than  99.8%)  are  http 
connections,  with  small  percentage  of  other  events,  such  as  ftp,  srnap,  sendmail,  etc. 

We  have  made  some  necessarily  modifications  to  the  HMM  to  make  it  more  precisely  in 
describing  user  behavior  through  network  traffic  log  data. 

1.  Determining  number  of  states:  The  state  in  our  model  represents  an  abstract 
relatively  stable  status  of  a  computer  user,  corresponding  to  the  users’  main  activity 
within  a  given  period  of  time.  Viewed  from  audit  data,  each  state  corresponds  to  dif¬ 
ferent  subgroup  of  events.  For  example,  when  user’s  current  state  is  “programming” , 
the  dominant  events  will  be  editing/compiling;  when  the  state  changes  to  “surfing 
web” ,  dominant  events  will  be  related  to  HTTP. 

2.  Two  Critical  states:  We  choose  “login”  as  an  initial  state  of  the  HMM  model.  The 
initial  state  is  the  entry  state  that  records  some  important  information  of  the  user, 
such  as  user  name,  login  time,  source  IP  address,  login  failure  times,  etc.  Among  all 
the  intrusion  actions,  a  large  percentage  of  them  are  to  gain  unauthorized  root  access 
into  a  local  computer  system,  either  from  a  local  user  account  or  directly  from  outside. 
So  we  add  a  state  called  “user-to-root”  state  to  provide  information  about  how  root 
privilege  is  obtained.  In  the  overall  HMM  model  matching  algorithm,  we  put  heavier 
weight  on  the  matching  score  of  this  state.  If  a  user  usually  logs  in  as  a  supervisor, 
then  the  “user-to-root”  state  is  the  same  as  the  initial  state. 

3.  Data  Segmentation:  To  partition  the  audit  events  traces  into  discrete  events,  we 
take  into  account  the  fact  that  different  state  has  different  type  of  “dominant”  com¬ 
mands.  So  we  use  ’’window”  concept  in  our  model.  For  example,  if  in  the  last  20 
events  the  type  of  dominant  events  has  changed  from  A  to  B,  then  we  suspect  that 
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the  state  has  transited  from  state  A  to  B.  Here  we  apply  two  types  of  windows:  time 
window  and  event  counter  window,  which  deal  with  time  interval  and  event  counters 
respectively. 

4.  Feature  selection:  The  simplest  way  is  to  only  record  the  distribution  of  different 
events  in  a  state.  For  intrusion  detection,  this  is  surely  not  enough.  In  order  to  make 
more  accurate  detection,  we  suggest  to  use  some  rule-based  detection  techniques  to 
add  more  features,  Such  as  command  duration  time,  state  duration  time  (exponential 
distributed),  overall  command  number,  number  of  “hot  actions”  (e.g.,  access  to  system 
directories,  creation  and  execution  of  programs,  etc),  number  of  access  to  “access 
control”  files  (e.g.,  /etc/passwd,  . rhosts),  etc. 

5.4.3  Model  Training 

From  above  discussion,  we  have  known  that  the  number  of  states  is  pre-determined.  Since 
we  have  partitioned  the  audit  data  sequences  into  separate  parts,  each  part  with  its  own 
“dominant”  events,  so  it  is  easy  to  know  which  state  the  user  transit  to.  It  means  that 
the  states  here  are  almost  sure  observable.  So  the  state  transition  probability  and  state 
duration  time  can  be  easily  calculated  from  training  data. 

For  each  state  record,  the  distribution  of  commands  can  be  easily  obtained  by  cal¬ 
culated  frequency.  For  other  parameters,  such  as  command  duration  time,  hot  actions 
number,  accessing  critical  file  numbers,  we  can  treat  them  as  Gaussian  distribution.  This 
approximation  is  reasonable  and  simple  to  implement.  It  should  be  noted  that  since  most 
parameters  in  HMM  have  physical  meaning,  the  system  manager  can  set  initial  values  to 
these  parameters  in  advance.  So  the  training  task  will  be  light-burdened  and  more  efficient. 


5.4.4  Calculating  Similarity  Measure 


Since  our  HMM  model  has  modified  a  lot  from  original  model,  the  matching  criterion  is 
not  as  simple  as  calculating  the  probability  of  the  observation  sequence  by  the  given  model. 
Instead,  we  compute  a  “suspicious  score”  for  the  matching  process. 

The  critical  state,  i.e.,  the  initial  state  and  “user-to-root”  state  has  higher  weights  while 
other  state  has  lower  weights.  Each  state  has  its  own  suspicious  rate  score,  computed  by 
integrating  the  score  of  difference  between  each  parameter  and  its  observation  value.  Then 
we  compare  the  suspicious  score  of  the  user  with  a  threshold  level.  The  threshold  level  is 
set  to  trade  off  false  alarm  and  missed  detections.  Each  parameter  in  the  record  of  state  is 
treated  as  a  Gaussian  distributed  random  variable. 


Let  J  be  the  “suspicious  score”  of  an  HMM  model.  The  model  has  K  states.  During 
preprocessing,  the  user’s  audit  event  sequences  are  divided  into  N  parts,  and  so  he  has 
N  —  1  state  transition.  We  calculate  the  “suspicious  score”  by: 


N 


j  =  W]  S]  +  ^ 


fc=2  PijWjSj 


(5.7) 
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where  Pt.j  is  the  probability  of  state  transition  from  state  i  to  state  j .  Wj  is  the  weight  factor 
of  state  j  (the  “user-to-root”  state  and  initial  state  have  larger  Wj),  Sj  is  the  suspicious 

score  of  state  j  compared  with  the  kth  observed  state.  For  example,  if  a  user’s  observed 

action  sequence  is  <S11S2<S4<S3  £2,  then  the  suspicious  score  is: 

J  =  TddSi  +  -J-W2S$  +  -^W4S'i  +  T^W3Sl  +  ~^W2Sl  (5.8) 

1 1,2  ^2,4  cq,3  Pi, 2 

Suppose  there  are  M  parameters  in  state  i,  each  parameter  Xj ,  j  =  1, . . . ,  M  has  expec¬ 
tation  £j  and  variance  Uj,  the  observed  value  of  parameter  is  Oj ,  then  Sf:  can  be  calculated 
as: 


M 


j= 1 


(' °j 


(7 1 


(5.9) 


where  Uj  is  the  weight  factor  of  parameter  Xj.  The  weight  factors  Wj  and  uij  can  be  set  by 
the  system  manager  and  be  modified  from  training. 


5.4.5  Discussion 

SIAC  log  data  contains  two  parts  of  data:  normal  and  abnormal.  In  the  abnormal  part, 
there  are  some  kinds  of  intrusion  attempts  within.  Our  mission  is  to  locate  these  intrusion 
attempts  in  the  abnormal  part,  with  normal  data  as  training  data.  This  set  of  experiments 
are  not  successful.  Here  we  will  discuss  why  HMM  failed  in  modelling  with  SIAC  data. 

1.  Insufficient  information:  In  preprocessing,  too  much  information  has  been  ignored. 
SIAC  data  contains  mainly  two  parts  of  data,  http  connections  and  non-http  connec¬ 
tions.  In  normal  data,  the  non-http  connections  only  constitute  less  than  0.2%,  so  this 
part  of  data  are  not  significant  enough  in  preprocessing  for  statistical  approximation. 
But  the  most  likely  intrusion  here  is  in  non- http  connections,  so  only  dealing  with 
http  connections  is  not  enough. 

On  the  other  hand,  even  for  http  connections,  too  much  information  has  been  ignored 
due  to  the  potential  burden  due  to  the  complexity  of  information  format. 
Furthermore,  if  this  set  of  log  data  itself  does  not  contain  enough  information  to  dis¬ 
criminate  between  normal  and  abnormal  behavior,  there  will  be  no  way  to  find  out 
the  intrusions. 

2.  Difficulty  in  quantization:  The  preprocessed  data  of  SIAC  http  connection  are 
vectors  of  sequential  data,  but  different  terms  in  the  vector  have  totally  different 
numerical  domain.  The  last  three  attributes  are  of  in  the  range  of  1,2,3,  while  the 
first  four  attributes  vary  from  0  to  106.  The  vector  elements  must  be  quantized  into  the 
same  domain.  But  here  it  is  difficult  to  do  it  without  a  suitable  Vector  Quantization 
method. 
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3.  Common  problems  with  statistics  based  anomaly  detection:  For  statistical 
based  anomaly  detection  to  be  successful,  we  have  some  basic  assumptions:  First,  it 
requires  statistically  enough  normal  behavior  data  for  training  to  be  able  to  cover  the 
variance  of  normal  behavior,  which  sometimes  is  very  difficult  to  achieve;  Second,  in 
order  to  find  out  whether  there  are  intrusions  in  a  segment  of  test  data,  the  test  data 
must  also  be  rich  enough  for  statistical  based  analysis.  Otherwise,  there  will  not  be 
enough  statistical  information  in  the  test  data  segment  to  justify  the  results. 

On  the  other  hand,  if  we  make  the  test  data  segment  long  enough  for  statistical  anal¬ 
ysis,  we  may  be  able  to  only  determine  whether  there  are  abnormal  behaviors  in  this 
data  segment,  but  not  able  to  tell  where  they  are  and  what  kind  of  intrusion  it  is.  This 
part  of  workload  is  left  for  rule-based  intrusion  detection  system  or  human  experts. 


5.5  Summary 

In  this  chapter,  we  have  discussed  the  applications  of  clustering  methods  in  hierarchical 
simulation  of  complex  systems  and  in  system  modelling  for  computer  security.  We  propose 
to  use  clustering  techniques  between  high-  and  low-resolution  modules  for  the  analysis  and 
simulation  of  multi-resolution  to  preserve  the  statistics  fidelity.  We  demonstrated  that 
Hidden  Markov  Model  can  be  an  effective  clustering  tool  for  this  task.  Then  we  shift  to 
the  possible  applications  of  clustering  techniques  in  the  domain  of  computer  security  based 
on  the  observation  of  hierarchical  structure  computer  security  systems.  Three  aspects  of 
behavioral  models  are  discussed  with  the  possible  applications  of  clustering  as  the  important 
components  between  different  levels  of  audit  events.  We  then  attempted  to  use  Hidden 
Markov  Models  for  the  purpose  of  system  modelling  for  anomaly  detection,  discussed  several 
issues  related  to  this  problem  and  possible  solutions. 
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Chapter  6 


OPTIMIZATION  EXAMPLES 


In  this  chapter  we  will  use  the  results  from  the  concurrent  simulation  chapter  for  control  and 
optimization.  We  will  first  use  the  IPA  estimates  for  performing  management  functions  in 
a  single  node  in  a  communication  network.  Subsequently,  we  use  concurrent  estimation  [21] 
together  with  the  “surrogate”  methodology  [35,  32]  to  perform  multi-commodity  resource 
allocation  in  the  context  of  mission  planning  in  a  Joint  Air  Operation  (JAO)  environment. 


6.1  Optimal  Buffer  Control  Using  SFM-Based  IPA  Estima¬ 
tors 


We  consider  here  an  optimization  problem  for  single-node  SFMs  involving  loss  volume 
and  workload  levels;  both  are  network-related  performance  metrics  associated  with  buffer 
control  or  call-admission  control.  In  a  typical  buffer  control  problem,  for  instance,  the 
optimization  problem  involves  the  determination  of  a  threshold  (measured  in  packets  or 
bytes)  that  minimizes  a  weighted  sum  of  loss  volume  and  buffer  content.  One  possible 
problem  formulation  is  to  determine  a  threshold  C  that  minimizes  a  cost  function  of  the 
form 

JT(C )  =  Qt{C )  +R-Lt{C) 

trading  off  the  expected  loss  rate  with  a  rejection  penalty  R  for  the  expected  queue  length. 
If  a  SFM  is  used  instead,  then  the  cost  function  of  interest  becomes 

m  =  \e[Qt{6)]  +  *E[Lt(0)\. 

In  the  case  of  the  simple  buffer  control  problem,  we  are  interested  in  estimating  dJx/dO 
based  on  directly  observed  (simulated)  data.  We  can  then  seek  to  obtain  6*  such  that  it 
minimizes  Jt{0)  through  an  iterative  scheme  of  the  form 

en+i  =  0n  -  vnHn(On,UnFM),  n  =  0, 1, . . .  (6.1) 
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where  {vn}  is  a  step  size  sequence  and  Hn(9n,uEFM)  is  an  estimate  of  cUt/cLO  evaluated 
at  6  =  9n  and  based  on  information  obtained  from  a  sample  path  of  the  SFM  denoted  by 
u>FFM.  However,  as  we  saw  in  Chapter  3,  the  simple  form  of  Hn(9n,uJnFM )  also  enables  us 
to  apply  the  same  scheme  to  the  original  discrete  event  system: 

Cn+ 1  =  cn-  vnHn(Cn ,  uEES),  n  =  0, 1, . . .  (6.2) 

where  Cn  is  the  threshold  used  for  the  nth  iteration  and  ojees  is  a  sample  path  of  the 
discrete  event  system. 

The  gradient  estimator  Hn(9,ujEFM)  is  the  IPA  estimator  of  dJ/d9  based  on  (3.9)  and 
(3.10): 

=  i  (6.3) 

ke<s>(e) 

evaluated  over  a  simulated  sample  path  u>EFM  of  length  T,  following  which  a  control  update 
is  performed  through  (6.1)  based  on  the  value  of  Hn(9,uEFM).  The  interesting  observation 
here  is  that  the  same  estimator  may  be  used  in  (6.2)  as  follows:  If  a  packet  arrives  and  is 
rejected,  the  time  this  occurs  is  recorded  as  r  in  Algorithm  1.  At  the  end  of  the  current 
busy  period,  the  counter  C  and  timer  T  are  updated.  Thus,  the  exact  same  expression  as 
in  the  right-hand  side  of  (6.3)  can  be  used  to  update  the  threshold  Cn+\  in  (6.2). 

Figure  6.1  depicts  examples  of  the  application  of  this  scheme  to  a  single-node  SFM 
under  six  different  parameter  settings  (scenarios),  summarized  in  Table  6.1.  ‘DES’  denotes 
curves  obtained  by  estimating  Jt{C)  over  different  (discrete)  values  of  C,  ‘SFM’  denotes 
curves  obtained  by  estimating  J(9)  over  different  values  of  9,  and  ‘Opt. Algo.’  represents  the 
optimization  process  (6.2),  where  we  maintain  real-valued  thresholds  throughout.  The  first 
three  scenarios  correspond  to  a  high  traffic  intensity  p  compared  to  the  remaining  three. 
For  each  example,  C*  is  the  optimal  threshold  obtained  through  exhaustive  simulation. 
In  all  simulations,  an  ON-OFF  traffic  source  is  used  with  the  number  of  arrivals  in  each 
ON  period  geometrically  distributed  with  parameter  p  and  arrival  rate  cc;  the  OFF  period 
is  exponentially  distributed  with  parameter  /r;  and  the  service  rate  is  fixed  at  f).  Thus, 
the  traffic  intensity  of  the  system  p  is  a(^) / +  jj)>  where  ^  is  the  average  length 
of  an  ON  period  and  jj  is  the  average  length  of  an  OFF  period.  The  rejection  cost  is 
R  =  50.  For  simplicity,  un  in  (6.2)  is  taken  to  be  a  constant  un  =  5.  Finally,  in  all  cases 
T  =  100,000.  As  seen  in  Fig.  6.1,  the  threshold  value  obtained  through  (6.2)  using  the 
SFM-based  gradient  estimator  in  (6.3)  either  recovers  C*  or  is  close  to  it  with  a  cost  value 
extremely  close  to  since  in  some  cases  the  cost  function  is  nearly  constant  in  the 

neighborhood  of  the  optimum,  it  is  difficult  to  determine  the  actual  optimal  threshold, 
but  it  is  also  practically  unimportant  since  the  cost  is  essentially  the  same.  We  have 
also  implemented  (6.2)  with  Hn(Cn,ujEES)  estimated  over  shorter  interval  lengths  T  = 
10,000  and  T  =  5,000,  with  virtually  identical  results.  Looking  at  Fig.  6.1,  it  is  worth 
observing  that  determining  9*  as  an  approximation  to  C*  through  off-line  analysis  of  the 
SFM  would  also  yield  good  approximations,  further  supporting  the  premise  of  this  chapter 
that  SFMs  provide  an  attractive  modeling  framework  for  control  and  optimization  (not  just 
performance  analysis)  of  complex  networks. 
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Scenario 

P 

a 

P 

P 

P 

C* 

1 

0.99 

1 

0.1 

0.1 

0.505 

7 

2 

0.99 

1 

0.05 

0.05 

0.505 

7 

3 

0.99 

2 

0.05 

0.1 

1.01 

15 

4 

0.71 

1 

0.1 

0.1 

0.7 

13 

5 

0.71 

1 

0.05 

0.05 

0.7 

11 

6 

0.71 

2 

0.05 

0.1 

1.4 

22 

Table  6.1:  Parameter  settings  for  six  examples 


Figure  6.1:  Optimal  threshold  determination  in  an  actual  system  using  SFM-based  gradient 
estimators  -  Scenarios  1-6 
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6.2  Multi-commodity  Resource  Allocation 


In  this  section  we  investigate  the  problem  of  mission  planning  in  the  context  of  the  Joint 
Air  Operation  (JAO)  environment.  For  this  problem  we  assume  that  there  are  Nt  type  i 
aircraft  that  can  be  used  in  any  given  mission.  For  simplicity  we  assume  that  there  are 
only  two  types  of  aircraft,  i  =  1  (strike  aircraft)  or  i  =  2  (wild  weasel).  The  objective 
is  to  dynamically  allocate  these  aircraft  to  various  missions  against  a  set  of  predefined 
targets.  We  assume  that  each  target  is  destroyed  probabilistically  and  the  probability  is 
monotonically  increasing  with  the  number  of  strike  aircraft  allocated  to  it.  Furthermore, 
when  destroyed,  each  target  carries  a  value  that  indicates  its  significance.  In  addition,  each 
target  j  is  generally  defended  by  a  set  of  rij  SAMs  that  constitute  risk  for  the  mission 
aircraft.  Every  destroyed  aircraft  incurs  a  cost  c*  while  destroyed  SAM  sites  constitute  no 
quantifiable  benefit.  Finally,  again  for  simplicity,  we  assume  that  the  path  that  the  mission 
will  follow  is  the  straight  line  between  the  aircraft  base  and  the  target.  The  problem  is 
to  dynamically  schedule  various  missions  until  either  all  targets  are  destroyed  or  there  are 
not  enough  aircraft  to  take  up  a  new  mission.  The  objective  is  to  maximize  the  expected 
value  obtained  from  all  destroyed  targets  minus  the  cost  of  lost  aircraft.  This  problem 
formulation  leads  to  a  non-convex  combinatorialy  hard  problem  which  we  solve  using  a 
“surrogate”  methodology  which  is  briefly  described  next  (For  more  details  the  reader  is 
referred  to  Appendix  D  or  [35,  32]). 

6.2.1  Basic  Approach  for  the  “Surrogate”  Method 

We  define  an  optimization  problem  the  general  form 

min  Jd(r)  =  E[Ld(r,  w)]  (6.4) 

reAd 

where  r  6  is  a  decision  vector  or  “state”  and  Ad  represents  a  constraint  set.  In  a 
stochastic  setting,  let  Ld(r,uj)  be  the  cost  incurred  over  a  specific  sample  path  uj  when 
the  state  is  r  and  Jd(r)  =  E[Ld(r,co)]  be  the  expected  cost  of  the  system  operating  under 
r.  The  sample  space  is  Q  =  [0,1]°°,  that  is,  u  G  is  a  sequence  of  random  numbers 
from  [0, 1]  used  to  generate  a  sample  path  of  the  system.  The  cost  functions  are  defined 
as  Ld  :  Ad  x  12  — >  R  and  Jd  :  Ad  — >  R,  and  the  expectation  is  defined  with  respect  to  a 
probability  space  (Tt,  A,  P)  where  $1  is  an  appropriately  defined  cr-field  on  11  and  P  is  a 
conveniently  chosen  probability  measure.  In  the  sequel,  ‘c A  is  dropped  from  Ld(r,co)  and, 
unless  otherwise  noted,  all  costs  will  be  over  the  same  sample  path. 

Let  the  expected  cost  function  Jd{r)  is  generally  nonlinear  in  r,  a  vector  of  integer¬ 
valued  decision  variables,  therefore  (6.4)  is  a  nonlinear  integer  programming  problem.  One 
common  method  for  solving  this  problem  is  to  relax  the  integer  constraints  on  all  rt  so 
that  they  can  be  regarded  as  continuous  (real-valued)  variables  and  then  to  apply  standard 
optimization  techniques  such  as  gradient-based  algorithms.  Let  the  “relaxed”  set  Ac  contain 
the  original  constraint  set  Ad  and  define  Lc  :  R+  x  Q  — >  R  to  be  the  cost  function  over 
a  specific  sample  path.  The  resulting  “surrogate”  problem  then  becomes:  Find  p*  that 
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minimizes  the  “surrogate”  expected  cost  function  Jc  :  — >  M  over  the  continuous  set  Ac, 

i.e., 

Jc(p*)  =  nrin  Jc(p )  =  E[Lc(p)\  (6.5) 

PSAC 

where  p  G  is  a  real- valued  state,  and  the  expectation  is  defined  on  the  same  probability 
space  (0,  O',  P )  as  described  earlier.  Assuming  an  optimal  solution  p*  can  be  determined, 
this  state  must  then  be  mapped  back  into  a  discrete  vector  by  some  means  (usually,  some 
form  of  truncation).  Even  if  the  final  outcome  of  this  process  can  recover  the  actual  r*  in 
(6.4),  this  approach  is  strictly  limited  to  off-line  analysis:  When  an  iterative  scheme  is  used 
to  solve  the  problem  in  (6.5)  (as  is  usually  the  case  except  for  very  simple  problems  of  limited 
interest),  a  sequence  of  points  {pn}  is  generated;  these  points  are  generally  continuous  states 
in  Ac,  hence  they  may  be  infeasible  in  the  original  discrete  optimization  problem.  Moreover, 
if  one  has  to  estimate  E[Lc(p)\  or  through  simulation,  then  a  simulation  model  of 

the  surrogate  problem  must  be  created,  which  is  also  not  generally  feasible.  If,  on  the  other 
hand,  the  only  cost  information  available  is  through  direct  observation  of  sample  paths  of 
an  actual  system,  then  there  is  no  obvious  way  to  estimate  E[Lc(p)\  or  9e[qAp)\  ^  since  this 
applies  to  the  real-valued  state  p ,  not  to  the  integer-valued  actual  state  r. 

Here  we  adopt  a  different  approach  intended  to  operate  on  line.  In  particular,  we 
still  invoke  a  relaxation  such  as  the  one  above,  i.e.,  we  formulate  a  surrogate  continuous 
optimization  problem  with  some  state  space  Ac  C  M+  and  A ^  C  Ac.  However,  at  every 
step  n  of  the  iteration  scheme  involved  in  solving  the  problem,  both  the  continuous  and 
the  discrete  states  are  simultaneously  updated  through  a  mapping  of  the  form  rn  =  fn(pn)- 
This  has  two  advantages:  First,  the  cost  of  the  original  system  is  continuously  adjusted 
(in  contrast  to  an  adjustment  that  would  only  be  possible  at  the  end  of  the  surrogate 
minimization  process);  and  second,  it  allows  us  to  make  use  of  information  typically  needed 
to  obtain  cost  sensitivities  from  the  actual  operating  system  at  every  step  of  the  process. 

Initially,  we  set  the  “surrogate  system”  state  to  be  that  of  the  actual  system  state,  i.e., 

Po  =  ro  (6-6) 

Subsequently,  at  the  nth  step  of  the  process,  let  Hn(pn.  rn.  un)  denote  an  estimate  of  the 

sensitivity  of  the  cost  Jc{pn )  with  respect  to  pn  obtained  over  a  sample  path  un  of  the  actual 
system  operating  under  allocation  rn.  Two  sequential  operations  are  then  performed  at  the 
nth  step: 

1.  The  continuous  state  pn  is  updated  through 

Pn+ 1  =  7r n+l[Pn  ~  ^ii-^nlftidnAn)]  (6-7) 

where  7rn+i  :  — >  Ac  is  a  projection  function  so  that  pn+ 1  G  Ac  and  r)n  is  a  “step 

size”  parameter. 

2.  The  newly  determined  state  of  the  surrogate  system,  pn+1,  is  transformed  into  an 
actual  feasible  discrete  state  of  the  original  system  through 

rn+i  =  fn+i(pn+i)  (6-8) 
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where  fn+i  ■  Ac  — >  Ad  is  a  mapping  of  feasible  continuous  states  to  feasible  discrete 
states  which  must  be  appropriately  selected  as  will  be  discussed  later. 

One  can  recognize  in  (6.7)  the  form  of  a  stochastic  approximation  algorithm  (e.g.,  [50]) 
that  generates  a  sequence  { pn }  aimed  at  solving  (6.5).  However,  there  is  an  additional 
operation  (6.8)  for  generating  a  sequence  {rn}  which  we  would  like  to  see  converge  to  r* 
in  (6.4).  It  is  important  to  note  that  {rn}  corresponds  to  feasible  realizable  states  based 
on  which  one  can  evaluate  estimates  Hn(pn,  rn,ujn)  from  observable  data,  i.e. ,  a  sample 
path  of  the  actual  system  under  rn  (not  the  surrogate  state  pn).  We  can  therefore  see  that 
this  scheme  is  intended  to  combine  the  advantages  of  a  stochastic  approximation  type  of 
algorithm  with  the  ability  to  obtain  sensitivity  estimates  with  respect  to  discrete  decision 
variables.  In  particular,  the  sensitivity  estimation  methods  studied  in  Chapter  3  are  ideally 
suited  to  meet  this  objective. 


6.2.2  Optimization  Algorithm 

In  this  section  we  simply  summarize  the  algorithm  used  to  solve  the  basic  problem  in  (6.4) 

(For  more  details  refer  to  [35,  32]). 

Algorithm  2 

Step  0.  Initialize  p0  =  ro  and  perturb  p0  to  have  all  components  non-integer. 

For  any  iteration  n  =  0, 1, . . . 

Step  1.  Determine  S(pn)  [using  the  construction  of  Theorem  3.1  [32];  recall  that  this  set 
is  generally  not  unique]. 

Step  2.  Select  fn  £  TPn  such  that  rn  =  arg  minreA/-(p  j  ||r  -  pn||  =  fn(pn)  e  N{pn). 

Step  3.  Operate  at  rn  to  collect  Ld(rl)  for  all  rl  £  S(pn)  [using  Concurrent  Estimation  or 
some  form  of  Perturbation  Analysis;  or,  if  feasible,  through  off-line  simulation]. 

Step  4.  Evaluate  VLc(pn). 

Step  5.  Update  the  continuous  state:  pn+\  =  nn+i[pn  —  rjnVLc(pn)]. 

Step  6.  If  some  stopping  condition  is  not  satisfied,  repeat  Steps  1-6  for  n  +  1.  Else,  set 
P*  =  Pn+ 1- 

Step  7.  Obtain  the  optimal  (or  the  near  optimal)  state  as  one  of  the  neighboring  feasible 
states  in  the  setN(p*). 

Note  that  for  separable  cost  functions,  Steps  1-6  can  be  replaced  by 

Step  1.  Select  fn  such  that  rn  =  arg minrejy(pj  ||r  —  pn||  =  fn(pn )  ^  A f(pn)- 
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Step  2.  Operate  at  rn  to  evaluate  VLc(pn)  using  Perturbation  Analysis. 

Step  3.  Update  the  continuous  state:  pn+1  =  irn+i[pn  —  r)nV Lc(pn)\. 

Step  4.  If  some  stopping  condition  is  not  satisfied,  repeat  Steps  1-4  for  n  +  1.  Else,  set 
P  Pn+ 1' 

Note  that  ideally  we  would  like  to  have  VJc(pn)  be  the  cost  sensitivity  driving  the 
algorithm.  Since  this  information  is  not  always  available  in  a  stochastic  environment  and 
since  Jc{pn )  =  E[Lc(pn, a;)],  the  stochastic  approximation  algorithm  uses  WLc(pn,uj)  as  an 
estimate  and  under  some  standard  assumptions  on  the  estimation  error  en  where 

^  Jc{.Pn)  =  c(/3n,  +  £n 

the  convergence  is  guaranteed.  In  order  to  get  VLc(pn,w),  however,  one  needs  to  consider 
all  possible  selection  sets.  In  this  algorithm  we  utilize  only  one  of  those  selection  sets 
and  approximate  VLc(pn,uj)  with  VLc(pn,  S(pn),u>).  This  approximation  introduces  yet 
another  error  term  en  where 

V4(PnA)  =  VLc(pn,  S(pn),u>)  +  £n 

Note  that  this  error  term  en  exists  regardless  of  stochasticity,  unless  the  cost  function  L^(.) 
is  separable  (all  selection  sets  will  yield  the  same  sensitivity  for  separable  cost  functions). 
We  can  combine  error  terms  to  define  en  =  en  +  en  and  write 

VJc(Pn)  =  VIc(p„,%J,w)  +  e„ 

If  the  augmented  error  term  en  satisfies  the  standard  assumptions,  then  convergence  of  the 
algorithm  to  the  optimal  follows. 


6.2.3  Multicommodity  Resource  Allocation  Problems 

An  interesting  class  of  discrete  optimization  problems  arises  when  Q  different  types  of 
resources  must  be  allocated  to  N  users.  The  corresponding  optimization  problem  we  would 
like  to  solve  is 

min  J(r) 

r&Ad 

where  r  =  [rgi, . . . ,  rgg,  •  •  •  ,  rjv,i, . . . ,  tn,q\  is  the  allocation  vector  and  r^q  is  the  number 
of  resources  of  type  q  allocated  to  user  i.  A  typical  feasible  set  Ad  is  defined  by  the  capacity 
constraints 

N 

^  ^  ri,q  —  Kqi  Q  =  1 1  ■  ■  ■  iQ 

i=l 

and  possibly  additional  constraints  such  as  <  r)i  for  i  =  1, . . . ,  N.  Aside  from 

the  fact  that  such  problems  are  of  higher  dimensionality  because  of  the  Q  different  resource 
types  that  must  be  allocated  to  each  user,  it  is  also  common  that  they  exhibit  multiple  local 
minima.  Examples  of  such  problems  are  encountered  in  mission  planning  that  involve  N 
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missions  to  be  simultaneously  performed,  each  mission  i  requiring  a  “package”  of  resources 
(j-j  i, . . . ,  'i~t  Q)  in  order  to  be  carried  out.  The  natural  trade-off  involved  is  between  carrying 
out  fewer  tasks  each  with  a  high  probability  of  success  (because  each  task  is  provided 
adequate  resources)  and  carrying  out  more  tasks  each  with  lower  probability  of  success. 


The  “surrogate  problem”  method  provides  an  attractive  means  of  dealing  with  these 
problems  with  local  minima  because  of  its  convergence  speed.  Our  approach  for  solving  these 
problems  is  to  randomize  over  the  initial  states  ro  (equivalently,  p0)  and  seek  a  (possibly 
local)  minimum  corresponding  to  this  initial  point.  The  process  is  repeated  for  different, 
randomly  selected,  initial  states  so  as  to  seek  better  solutions.  For  deterministic  problems, 
the  best  allocation  seen  so  far  is  reported  as  the  optimal.  For  stochastic  problems,  we  adopt 
the  stochastic  comparison  approach  in  [36] .  The  algorithm  is  run  from  a  randomly  selected 
initial  point  and  the  cost  of  the  corresponding  final  point  is  compared  with  the  cost  of  the 
“best  point  seen  so  far”.  The  stochastic  comparison  test  in  [36]  is  applied  to  determine 
the  “best  point  seen  so  far”  for  the  next  run.  Therefore,  the  surrogate  problem  method 
can  be  seen  as  a  complementary  component  for  random  search  algorithms  that  exploits  the 
problem  structure  to  yield  better  generating  probabilities  (as  discussed  in  [36]),  which  will 
eliminate  (or  decrease)  the  visits  to  poor  allocations  enabling  them  to  be  applied  on-line. 

In  what  follows  we  consider  a  problem  with  IV  =  16,  Q  =  2,  and  K\  =  20,  K2  =  8.  We 
then  seek  a  32— dimensional  vector  r  =  [rgi,  rq^,  •  •  •  ,  ri6,i, . . . ,  rig^]  to  maximize  a  reward 
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function  of  the  form 


16 


J{r) 


i=  1 


subject  to 


N 


N 


y.  Vi'1  <  20,  y  ri} 2  < 


1=1 


1=1 


The  reward  functions  J,(r)  we  will  use  in  this  problem  are  defined  as 

Ji  =  ViP?(r)  -  CiriAPl(r)  -  C2riy2P?(r) 


(6.9) 


(6.10) 


In  (6.10),  Vi  represents  the  “value”  of  successfully  completing  the  zth  task  and  Pf{r)  is  the 
probability  of  successful  completion  of  the  ith  task  under  an  allocation  r.  In  addition,  Cq  is 
the  cost  of  a  resource  of  type  q,  where  q  =  1,2,  and  Pf(r)  is  the  probability  that  a  resource 
of  type  q  is  completely  consumed  or  lost  during  the  execution  of  the  ith  task  under  an 
allocation  r.  A  representative  example  of  a  reward  function  for  a  single  task  with  V)  =  150 
is  shown  in  Fig.  6.2.  The  cost  values  of  resource  types  are  C\  =  20  and  C2  =  40,  and  the 
values  for  tasks  we  will  use  in  this  problem  range  between  50  and  150. 

The  surrogate  method  is  executed  from  random  initial  points  and  the  results  for  some 
runs  are  shown  in  Fig.  6.3.  Note  that  due  to  local  maxima,  some  runs  yield  suboptimal 
results.  However,  in  all  cases  convergence  is  attained  extremely  fast,  enabling  us  to  repeat 
the  optimization  process  multiple  times  with  different  initial  points  in  search  of  the  global 
maximum.  Although  it  is  infeasible  to  identify  the  actual  global  maximum,  we  have  com¬ 
pared  our  approach  to  a  few  heuristic  techniques  and  pure  random  search  methods  and 
found  the  “surrogate  problem”  method  to  outperform  them.  To  demonstrate  the  effective¬ 
ness  of  the  approach  we  have  developed  an  “applet”  for  the  scenario  described  above  which 
can  be  accessed  at  http://vita.bu.edu/cgc/alpha/index.htm. 
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Figure  6.3:  Algorithm  convergence  under  different  initial  points. 
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Chapter  7 


CONCLUSIONS  AND  FUTURE 
RESEARCH  DIRECTIONS 


In  this  section,  we  summarize  the  main  findings,  lessons  learned,  and  recommendations  for 
future  directions  that  have  resulted  from  this  project. 

Model  abstraction  through  fluid  simulation:  Fluid  models  (FM)  are  considered  as 
efficient  abstract  modeling  paradigms  that  can  either  approximate  the  dynamics  of  complex 
discrete-event  systems  or  constitute  primary  models  in  their  own  right.  In  FM,  the  state 
of  the  system  is  described  by  discrete  as  well  as  continuous  type  variables.  Furthermore, 
the  system  dynamics  are  both,  time-driven  and  event-driven,  therefore  FM  fall  in  the  class 
of  hybrid  simulation  models,  which  can  be  used  to  model  a  fairly  broad  class  of  systems 
including  battle  engagements,  communication  networks,  manufacturing  systems  and  many 
more. 

For  performance  evaluation,  when  a  FM  is  used  to  approximate  the  dynamics  of  a 
discrete-event  system  the  important  tradeoff  is  efficiency  vs.  accuracy.  This  tradeoff  relates 
to  the  resolution  (i.e.,  number  of  events  to  be  aggregated  together)  of  the  fluid  model  used 
which  is  thoroughly  investigated.  On  the  other  hand,  FM  can  be  used  for  the  control 
and  optimization  of  various  systems.  In  this  case,  a  FM  may  identify  the  solution  of  an 
optimization  problem  based  on  a  model  which  captures  only  those  features  of  the  underlying 
“real”  system  that  are  needed  to  lead  to  the  right  solution,  but  not  necessarily  estimate 
the  corresponding  optimal  performance  with  accuracy.  Even  if  the  exact  solution  cannot  be 
obtained  by  such  “lower-resolution”  models,  one  can  still  obtain  near-optimal  points  that 
exhibit  robustness  properties  with  respect  to  certain  aspects  of  the  model  they  are  based 
on.  This  property  of  FM  is  very  promising  and  deserves  further  investigation. 

Concurrent  simulation:  One  of  the  accomplishments  of  this  project  includes  signifi¬ 
cant  breakthroughs  in  the  area  of  sample  derivative  estimation  for  discrete-event  systems. 
In  this  context,  we  use  a  Stochastic  Fluid  Models  (SFM)  to  approximate  the  dynamics  of  the 
system.  Based  on  this  approximation,  we  derive  the  “structure”  of  the  sample  derivatives  of 
interest  which  turns  out  to  be  unbiased  and  nonparametric.  Finally,  we  evaluate  the  sample 
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derivatives  from  observations  on  the  sample  path  of  the  discrete- event  system  (simulated  or 
actual  system).  Note  that  several  attempts  to  derive  similar  sensitivity  estimates  directly 
from  the  discrete-event  system  are  often  biased. 

The  sample  derivative  analysis  is  promising  and  can  find  applications  in  several  areas 
such  as  communication  systems,  manufacturing  systems  etc.  In  the  context  of  communica¬ 
tion  networks,  it  is  possible  to  devise  extensions  to  multiple  flow  classes  that  can  be  used 
for  differentiating  traffic  classes  with  different  Quality-of-Service  (QoS)  requirements.  On¬ 
going  research  has  already  led  to  very  encouraging  results,  reported  in  [17,  16,  15],  involving 
IPA  estimators  and  associated  optimization  for  flow  control  purposes  in  multi-node  models. 
Furthermore,  extensions  to  networks  of  SFMs  are  underway. 

Model  abstraction  using  neural  networks:  The  nretanrodeling  procedure  we  have 
studied  combines  simulation  of  a  complex  system  with  the  process  of  training  a  neural 
network  to  become  a  surrogate  model  of  this  system.  This  exploits  the  ability  of  a  neural 
network  to  act  as  a  universal  function  approximators.  However,  if  a  neural  net  is  to  ad¬ 
equately  learn  the  functional  relationship  between  the  inputs  and  outputs  of  a  simulation 
model  it  requires  a  significant  number  of  input/output  pairs.  Since  such  information  can 
only  be  obtained  through  simulation,  it  implies  that  the  training  phase  of  the  neural  network 
will  be  rather  long.  For  this  reason,  we  investigate  the  use  of  sensitivity  information  (ex¬ 
tracted  through  some  concurrent  simulation  technique)  in  the  training  of  neural  networks. 
Our  preliminary  results  indicate  that  the  use  of  sensitivity  information  may  significantly 
reduce  the  number  of  required  input  /output  training  pairs. 

The  use  of  sensitivity  information  during  training  of  neural  networks  creates  several 
issues  that  need  to  be  addressed  and  are  part  of  our  future  plans.  First,  the  addition  of  the 
derivative  error  in  (4.9)  makes  the  training  objective  function  more  complex.  One  problem 
that  has  been  observed  during  our  experiments  is  that  this  addition  may  create  several 
local  minima  and  as  a  result  convergence  issues  may  arise.  Furthermore,  the  importance 
associated  with  the  derivative  errors  (i.e.,  parameter  (3)  needs  to  be  further  investigated. 
As  mentioned  earlier,  this  parameter  may  be  critical  to  the  quality  of  the  approximation  as 
well  as  the  convergence  of  the  training  algorithm. 

Hierarchical  simulation  and  statistical  fidelity:  We  have  investigated  the  inter¬ 
facing  of  high-  and  low-resolution  models  in  the  context  of  hierarchical  simulation.  Simple 
averaging  of  the  output  data  from  a  high-resolution  simulator  to  generate  input  data  for 
a  low- resolution  simulator  is  inadequate  and  occasionally  dramatically  erroneous.  There¬ 
fore,  to  maintain  the  statistical  fidelity,  high-resolution  output  data  should  be  classified  into 
groups  that  match  underlying  patterns  or  features  of  the  system  behavior  before  passing 
group  averages  to  the  low-resolution  modules.  In  an  effort  to  automate  the  interfacing 
procedure,  we  have  explored  various  clustering  tools  including  neural  networks  and  hid¬ 
den  Markov  models  (HMMs).  We  demonstrated  that  HMM  is  an  effective  clustering  tool 
especially  for  problems  with  high-dinrensional  input  spaces  (e.g.,  sample  path  clustering). 

Sample  path  clustering  can  also  find  applications  in  areas  other  than  statistical  fidelity 
preservation.  For  example,  we  have  investigated  the  HMM  in  the  domain  of  computer  secu¬ 
rity  based  on  the  observation  that  computer  security  systems  have  a  hierarchical  structure. 
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Three  aspects  of  behavioral  models  are  discussed  with  the  possible  applications  of  clustering 
as  the  important  components  between  different  levels  of  audit  events.  We  then  attempted 
to  use  HMMs  for  the  purpose  of  system  modelling  for  anomaly  detection.  This  approach 
seems  promising  and  we  believe  that  it  deserves  further  investigation. 
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Appendix  A 


PROOFS 


A.l  Proof  of  (2.3) 


0  and  ao  =  0.  From  Equation  (2.2), 
ma x(yn  +  an-h ,  an) 

max(7/n_ i  +  cin  +  cin—\  2 /i,  +  nn_i  /i,  nn) 

=  max 


=  max 

0<j<n 

We  do  not  give  the  induction  proof  due  to  the  space  limitation.  Handling  Equation  (2.1) 
in  the  same  way,  we  have 


Vo  +  an-i  ~(n+  1  )h 


i=o 


,  ma x.(y]an-i-jh) 

0<j<n  z ' 

2=0 


^  ^  Un—i  jh 


(by  2/o  =  0). 


(A.l) 


,  i=0 


Assume  xq  =  0,  yo  = 

?/n+ 1 


.  =  max 


max 

0<j<n 


y,  an-i  -  jh  -  h,  0 


,!=  0 


Combining  Equation  (A.l)  and  (A. 2),  we  get 


(A.2) 


xn+i  =  max(yn_|_i  -  h,  0)  .  (A.3) 

Given  Eqs.  (2.1)  and  (A.3),  we  can  derive 

max(yn+i  —  h,  0)  =  max(s„  +  an  —  h,  0),  Vn  >  0  . 


So 

and 


xn  —  Un+ 1  On  5 

Ex  =  Ey  —  Ea  . 
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A. 2  Proof  of  (2.12) 
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k=  1  j=l,j^k  i—l,i^k 
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fc-J'l 


(A.4) 


where  B  and  A  denote  the  2nd  and  3rd  item  in  RHS  of  Equation  (A.4)  respectively. 
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Perturbation  Analysis  for  On-Line 
Control  and  Optimization  of 
Stochastic  Fluid  Models 
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Abstract 

This  paper  uses  Stochastic  Fluid  Models  (SFM)  for  control  and  optimization  (rather  than 
performance  analysis)  of  communication  networks,  focusing  on  problems  of  buffer  control. 
We  derive  gradient  estimators  for  packet  loss  and  workload  related  performance  metrics 
with  respect  to  threshold  parameters.  These  estimators  are  shown  to  be  unbiased  and 
directly  observable  from  a  sample  path  without  any  knowledge  of  underlying  stochastic 
characteristics,  including  traffic  and  processing  rates  (i.e.,  they  are  nonparametric) .  This 
renders  them  computable  in  on-line  environments  and  easily  implementable  for  network 
management  and  control.  We  further  demonstrate  their  use  in  buffer  control  problems 
where  our  SFM-based  estimators  are  evaluated  based  on  data  from  an  actual  system. 
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B.l  Introduction 


A  natural  modeling  framework  for  packet-based  communication  networks  is  provided  through 
queueing  systems.  However,  the  huge  traffic  volume  that  networks  are  supporting  today 
makes  such  models  highly  impractical.  It  may  be  impossible,  for  example,  to  simulate  at 
the  packet  level  a  network  slated  to  transport  packets  at  gigabit-per-second  rates.  If,  on  the 
other  hand,  we  are  to  resort  to  analytical  techniques  from  classical  queueing  theory,  we  find 
that  traditional  traffic  models,  largely  based  on  Poisson  processes,  need  to  be  replaced  by 
more  sophisticated  stochastic  processes  that  capture  the  bursty  nature  of  realistic  traffic; 
in  addition,  we  need  to  explicitly  model  buffer  overflow  phenomena  which  typically  defy 
tractable  analytical  derivations. 

An  alternative  modeling  paradigm,  based  on  Stochastic  Fluid  Models  (SFM),  has  been 
recently  considered  for  the  purpose  of  analysis  and  simulation.  Introduced  in  [4]  and  later 
proposed  in  [48]  for  the  analysis  of  multiplexed  data  streams  and  network  performance  [25] , 
SFMs  have  been  shown  to  be  especially  useful  for  simulating  various  kinds  of  high-speed 
networks  [77],  [47],  [49],  [62],  [56],  [86],  [79].  The  fluid-flow  worldview  can  provide  either 
approximations  to  queueing-based  models  or  primary  models  in  their  own  right.  In  any 
event,  its  justification  rests  on  a  molecular  view  of  packets  in  moderate-to-heavy  loads  over 
high-speed  transmission  links,  where  the  effect  of  an  individual  packet  or  cell  on  the  entire 
traffic  process  is  virtually  infinitesimal,  not  unlike  the  effect  of  a  water  molecule  on  the 
water  flow  in  a  river. 

The  efficacy  of  a  SFM  rests  on  its  ability  to  aggregate  multiple  events.  For  example,  a 
discrete  event  simulation  run  of  an  ATM  link  operating  at  622  Megabits-per-second  may 
have  to  process  over  a  million  events  per  second.  On  the  other  hand,  if  traffic  arrives  from 
the  source  at  rates  that  are  piecewise-constant  functions  of  time,  then  a  simulation  run 
would  process  only  one  event  per  rate  change.  Thus,  30  rate  changes  per  second  (as  in 
certain  video  encoders)  may  require  the  processing  of  only  30  events  per  second.  In  effect, 
the  SFM  paradigm  allows  the  aggregation  of  multiple  events,  associated  with  the  movement 
of  individual  packets/cells  over  a  time  period  of  a  constant  flow  rate,  into  a  single  event 
associated  with  a  rate  change.  It  foregoes  the  identity  and  dynamics  of  individual  packets 
and  focuses  instead  on  the  aggregate  flow  rate. 

For  the  purpose  of  performance  analysis  with  Quality  of  Service  (QoS)  requirements,  the 
accuracy  of  SFMs  depends  on  traffic  conditions,  the  structure  of  the  underlying  system,  and 
the  nature  of  the  performance  metrics  of  interest.  By  foregoing  the  identity  of  individual 
packets,  the  SFM  paradigm  is  more  suitable  for  network-related  measures,  such  as  buffer 
levels  and  packet  loss  volumes,  rather  than  packet-related  measures  such  as  sojourn  times 
(although  it  is  still  possible  to  define  fluid-based  sojourn  times  [80]).  A  QoS  metric  that 
depends  on  the  identity  of  certain  packets,  for  example,  cannot  be  obviously  captured 
by  a  fluid  model.  Moreover,  some  metrics  may  depend  on  higher-order  statistics  of  the 
distributions  of  the  underlying  random  variables  involved,  which  a  fluid  model  may  not  be 
able  to  accurately  capture. 

In  this  paper,  our  goal  is  to  explore  the  use  of  SFMs  for  the  purpose  of  control  and 
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optimization  rather  than  performance  analysis.  In  this  case,  it  is  reasonable  to  expect 
that  the  solution  of  an  optimization  problem  can  be  identified  through  a  model  which 
captures  only  those  features  of  the  underlying  “real”  system  that  are  needed  to  lead  to  the 
right  solution,  even  though  the  corresponding  optimal  performance  may  not  be  accurately 
estimated.  Even  if  the  exact  solution  cannot  be  obtained  by  such  “lower-resolution”  models, 
one  can  still  obtain  near-optimal  points  that  exhibit  robustness  with  respect  to  certain 
aspects  of  the  model  they  are  based  on.  Such  observations  have  been  made  in  several 
contexts  (e.g.,  [63]),  including  recent  results  related  to  SFMs  reported  in  [60]  where  a 
connection  between  the  SFM  and  queueing-system-based  solution  is  established  for  various 
optimization  problems  in  queueing  systems. 

With  this  in  mind,  we  consider  here  optimization  problems  for  single-node  SFMs  in¬ 
volving  loss  volume  and  workload  levels;  both  are  network-related  performance  metrics 
associated  with  buffer  control  or  call-admission  control.  In  a  typical  buffer  control  problem, 
for  instance,  the  optimization  problem  involves  the  determination  of  a  threshold  (measured 
in  packets  or  bytes)  that  minimizes  a  weighted  sum  of  loss  volume  and  buffer  content.  As 
the  motivating  example  presented  in  Section  2  illustrates,  a  solution  of  this  problem  based 
on  a  SFM  gives  a  close  approximation  to  the  solution  of  the  associated  queueing  model. 
Since  solving  such  problems  usually  relies  on  gradient  information,  estimating  the  gradient 
of  a  given  cost  function  with  respect  to  the  aforementioned  threshold  parameters  in  a  SFM 
becomes  an  essential  task.  Perturbation  Analysis  (PA)  methods  [41],  [20]  are  therefore  suit¬ 
able,  if  appropriately  adapted  to  a  SFM  viewed  as  a  discrete-event  system.  Liu  and  Gong 
[57],  for  example,  have  used  PA  to  analyze  an  infinite-capacity  SFM,  with  incoming  traffic 
rates  as  the  parameters  of  interest.  In  this  paper  we  show  that  Infinitesimal  Perturbation 
Analysis  (IPA)  yields  remarkably  simple  sensitivity  estimators  for  packet  loss  and  workload 
metrics  with  respect  to  threshold  or  buffer  size  parameters.  These  estimators  also  turn  out 
to  be  nonparametric  in  the  sense  that  they  are  computable  from  data  directly  observable 
along  a  sample  path,  requiring  no  knowledge  of  the  underlying  probability  law,  including 
distributions  of  the  random  processes  involved,  or  even  parameters  such  as  traffic  or  pro¬ 
cessing  rates.  In  addition,  the  estimators  obtained  are  unbiased  under  very  weak  structural 
assumptions  on  the  defining  traffic  processes.  Therefore,  the  IPA  gradient  estimators  that 
we  derive  can  be  readily  used  for  on-line  control  purposes  to  perform  periodic  network 
management  functions  in  order  to  guarantee  negotiated  QoS  parameters  and  to  improve 
performance.  For  instance,  a  network  can  monitor  its  relative  loss  rate  and  mean  buffer 
contents  for  a  period  of  time,  and  then  adjust  admission  parameters,  provision  transmission 
capacities,  or  reassign  threshold  levels  in  order  to  improve  performance.  Such  management 
functions  have  not  been  standardized,  and  typically  are  performed  in  ad-hoc  ways  by  moni¬ 
toring  performance  levels.  Aside  from  solving  explicit  optimization  problems,  IPA  gradient 
estimators  simplify  the  implementation  of  sensitivity  analysis. 

The  contributions  of  this  paper  are  as  follows.  First,  we  consider  a  single-node  SFM 
and  derive  IPA  gradient  estimators  for  performance  metrics  related  to  loss  and  workload 
levels  with  respect  to  threshold  parameters  (equivalently,  buffer  sizes).  One  can  derive 
such  estimators  by  either  (a)  considering  the  finite  difference  of  a  performance  metric  as 
a  function  of  the  finite  difference  of  a  parameter  and  then  use  explicit  limit  arguments  to 
obtain  an  unbiased  estimate  of  the  performance  metric  derivative,  or  (b)  deriving  the  sample 
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derivative  for  the  performance  metric  involved,  and  then  proving  that  it  indeed  yields  an 
unbiased  estimate.  The  former  approach  provides  clear  insights  into  the  dynamic  process 
of  generation  and  propagation  of  perturbations,  which  is  very  helpful  in  understanding 
how  to  extend  the  approach  to  multiple  fluid  classes  and  multiple  nodes.  In  addition,  it 
requires  no  technical  conditions  on  the  traffic  processes  or  the  sample  functions  involved. 
However,  this  approach  is  tedious,  even  for  a  simple  single-node  model.  The  latter  approach 
is  simpler  and  more  elegant,  at  the  expense  of  some  mild  technical  conditions  needed  to 
justify  the  evaluation  of  the  sample  derivative.  It  requires,  however,  some  results  from 
the  first  approach  in  order  to  prove  unbiasedness  of  the  derived  estimators.  Thus,  in  this 
paper,  we  start  with  the  former  approach,  and  then  show  that  the  estimators  derived  are 
equivalent  to  the  latter,  which  we  subsequently  adopt.  Based  on  these  estimators,  we 
also  present  simple  algorithms  for  implementing  them  on  line,  taking  advantage  of  their 
nonparametric  nature. 

The  second  contribution  of  the  paper  is  to  make  use  of  the  IPA  gradient  estimators 
derived  to  tackle  buffer  control  as  an  optimization  problem.  In  particular,  we  seek  to  de¬ 
termine  the  threshold  value  that  minimizes  a  given  performance  metric.  Packet-by-packet 
buffer  control  can  be  applied  after  the  session  admission  decision  is  made  in  order  to  dy¬ 
namically  adjust  network  resources  so  as  to  minimize  some  cost  based  on  the  promised 
QoS.  We  use  a  standard  gradient-based  stochastic  optimization  scheme,  where  we  estimate 
the  gradient  of  the  performance  function  with  respect  to  the  threshold  parameter  on  the 
SFM;  however,  due  to  the  simplicity  of  this  gradient  estimator,  we  evaluate  it  based  on  data 
observed  on  a  sample  path  of  the  actual  (discrete- event)  system.  Thus,  we  use  the  SFM 
only  to  obtain  a  gradient  estimator;  the  associated  value  at  any  operating  point  is  obtained 
from  real  system  data. 

The  paper  is  organized  as  follows.  First,  in  Section  2,  we  motivate  our  approach  with 
a  buffer  control  problem  in  the  SFM  setting  and  show  the  application  of  IPA  to  it.  In 
Section  3,  we  describe  the  detailed  SFM  setting  and  define  the  performance  metrics  and 
parameters  of  interest.  In  Section  4,  we  derive  IPA  estimators  for  the  sensitivities  of  the 
expected  loss  rate  and  workload  with  respect  to  threshold  parameters  (equivalently,  buffer 
sizes)  and  show  their  unbiasedness.  This  is  first  demonstrated  by  a  direct  approach  based 
on  finite  differences.  The  IPA  approach  is  then  generalized  by  evaluating  sample  derivatives 
(at  the  expense  of  introducing  some  mild  technical  conditions);  these  are  shown  to  provide 
unbiased  performance  derivative  estimators  which  are  of  nonparametric  nature.  Algorithms 
for  implementing  the  derivative  estimators  obtained  are  also  provided.  In  Section  5,  we 
show  how  the  SFM-based  derivative  estimates  can  be  used  on  line  using  data  from  the 
actual  system  (not  the  SFM)  in  order  to  solve  buffer  control  problems.  Finally,  in  Section 
6  we  outline  a  number  of  open  problems  and  future  research  directions  motivated  by  this 
work. 


B.2  A  Motivating  Example:  Threshold-Based  Buffer  Control 

This  section  presents  a  motivating  example  of  buffer  control  in  the  setting  of  both  a  queueing 
model  and  a  SFM  and  then  compares  the  two.  Consider  a  network  node  where  buffer  control 
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Figure  B.l:  Buffer  control  in  a  single  node 


at  the  packet  level  takes  place  using  a  simple  threshold-based  policy:  when  a  packet  arrives 
and  the  queue  length  is  below  a  given  amount  C,  it  is  accepted;  otherwise  it  is  rejected. 
Let  L(C)  denote  the  expected  loss  rate,  i.e. ,  the  expected  rate  of  packet  overflow  at  steady 
state,  and  let  Q(C)  denote  the  mean  queue  length  when  the  threshold  is  C.  We  then  define 
the  cost  function 

J{C)  =  Q(C)  +  R-L(C),  (B.l) 

where  R  is  a  penalty  associated  with  rejecting  a  packet.  Thus,  J(C )  captures  the  tradeoff 
between  providing  satisfactory  service  (low  delay)  and  rejecting  too  many  packets.  Since, 
arguably,  the  notion  of  steady  state  is  hard  to  justify  in  many  networks,  and  since  control 
decisions  need  to  be  made  periodically  or  in  response  to  apparent  adverse  network  condi¬ 
tions,  a  more  realistic  performance  measure  is  one  where  L(C)  and  Q(C)  are  replaced  by 
Lt{C )  and  Qt{C),  the  expected  loss  rate  and  mean  queue  length,  respectively,  over  the 
time-interval  [0,T].  We  then  consider 

JT(C )  =  Qt{C)  +R-Lt{C )  (B.2) 

to  be  the  cost  function  of  interest.  Care  must  be  taken  in  defining  the  above  expectations 
over  a  finite  time-horizon,  since  they  generally  depend  on  the  initial  conditions;  for  the  time 
being,  we  shall  assume  that  the  queue  is  empty  at  time  t  =  0,  and  revisit  this  point  later. 
Figure  B.l  depicts  the  queueing  system  under  consideration. 

The  packet  arrival  process  is  modeled  as  an  ON-OFF  source  so  that  packets  arrive  at  a 
peak  rate  a  during  an  ON  period,  followed  by  an  OFF  period  during  which  no  packets  arrive. 
The  packet  processing  rate  is  (3.  For  the  example  used  here  and  illustrated  in  Fig.  B.l,  the 
number  of  arrivals  in  each  ON  period  is  geometrically  distributed  with  parameter  p  = 
0.05  and  arrival  rate  a  =  2;  the  OFF  period  is  exponentially  distributed  with  parameter 
p  =  0.1;  and  the  service  rate  is  f3  =  1.01.  Thus,  the  traffic  intensity  of  the  system  is 
+  =  0.99,  where  ri’p  is  the  average  length  of  an  ON  period  and  ^  is  the  average 

length  of  an  OFF  period.  The  cost  function  Jt{C )  in  this  problem  is  piecewise  constant, 
hence  gradient-based  algorithms  cannot  be  used.  However,  by  exhaustively  simulating  this 
queueing  system  and  averaging  over  25  sample  paths  of  length  T  =  100,  000  time  units  and 
estimating  Jt(C')  over  different  discrete  values  of  C ,  we  obtained  the  curve  labeled  ‘DES’ 
in  Fig.  B.2,  using  a  rejection  penalty  R  =  50.  One  can  see  that  the  optimal  threshold  value 
in  this  example  is  C*  =  15. 
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Threshold 


Figure  B.2:  Cost  v.  threshold  comparison  for  DES  and  SFM 


Next,  we  adopt  a  simple  SFM  for  the  same  system,  treating  packets  as  “fluid”.  During 
an  ON  period,  the  fluid  volume  in  the  buffer,  x(t),  increases  at  rate  a  —  (3  (we  assume  a  >  f3, 
otherwise  there  would  be  no  buffer  accumulation),  while  during  an  OFF  period  it  decreases 
at  a  rate  (3.  The  cost  function  in  this  model  is 

4fm(6)  =  QtFM(9)  +  R  ■  L^fm(6)  (B.3) 

where  9  E  M+  is  the  threshold  used  to  reject  incoming  fluid  when  the  buffer  fluid  volume 
reaches  level  9.  The  corresponding  expected  loss  rate  and  mean  buffer  fluid  volume  over 
the  time-interval  [0,  T]  are  denoted  by  LjFM{9)  and  QtFM(0),  respectively.  Simulating 
this  model  under  the  same  ON-OFF  conditions  as  before  over  many  values  of  9  results  in 
the  curve  labeled  “SFM”  in  Fig.  B.2.  The  important  observation  is  that  the  two  optima 
are  close,  whereas  the  difference  in  the  actual  cost  estimates  can  be  substantial  (especially 
for  a  lightly  loaded  system).  In  fact,  9*  =  13  and  JFFM{  13)  =  17.073,  as  compared  to 
Jt(  13)  =  18.127  and  the  optimal  Jt(C *)  =  Jt(15)  =  18.012. 

Based  on  this  observation,  we  are  motivated  to  study  means  for  efficiently  identifying 
solutions  to  problems  formulated  in  a  SFM  setting.  It  is  still  difficult  to  obtain  analytical 
solutions,  however,  since  expressions  for  QjFM (9)  and  Lj,fm (9)  are  unavailable,  unless  the 
arrival  and  service  processes  in  the  actual  system  are  very  simple.  Therefore,  one  needs  to 
resort  to  iterative  methods  such  as  stochastic  approximation  algorithms  (e.g.,  [50]),  which 
are  driven  by  estimates  of  the  gradient  of  a  cost  function  with  respect  to  the  parameters  of 
interest. 

In  the  case  of  the  simple  buffer  control  problem  above,  we  are  interested  in  estimating 
dJx/d9  based  on  directly  observed  (simulated)  data.  We  can  then  seek  to  obtain  9*  such 
that  it  minimizes  Jt(9)  through  an  iterative  scheme  of  the  form 

9n+1  =  9n-vnHn(9n,uFFM),  n  =  0,1,...  (B.4) 

where  {isn}  is  a  step  size  sequence  and  Hn(9n,LO^FM)  is  an  estimate  of  dJx/d9  evaluated 
at  9  =  9n  and  based  on  information  obtained  from  a  sample  path  of  the  SFM  denoted  by 
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unFM ■  However,  as  we  will  see,  the  simple  form  of  Hn(0n,  wfFM)  to  be  derived  also  enables 
us  to  apply  the  same  scheme  to  the  original  discrete  event  system: 

Cn+1  =  Cn  -  vnHn(Cn,u)nES),  n  =  0,1,...  (B.5) 

where  Cn  is  the  threshold  used  for  the  nth  iteration  and  ES  is  a  sample  path  of  the  discrete 
event  system.  In  other  words,  analyzing  the  SFM  provides  us  with  the  structure  of  a  gradient 
estimator  whose  actual  value  can  be  obtained  based  on  data  from  the  actual  system.  In 
Fig.  B.2,  the  curve  labeled  “Opt. Algo.”  corresponds  to  this  process  and  illustrates  how  one 
can  indeed  recover  the  optimal  threshold  C*  =  15. 


B.3  The  Stochastic  Fluid  Model  (SFM)  Setting 


The  SFM  setting  is  based  on  the  fluid-flow  worldview,  where  “liquid  molecules”  flow  in 
a  continuous  fashion.  The  basic  SFM,  used  in  [80]  and  shown  in  Figure  3,  consists  of  a 
single-server  (spigot)  preceded  by  a  buffer  (fluid  storage  tank),  and  it  is  characterized  by 
five  stochastic  processes,  all  defined  on  a  common  probability  space  (O,  P,  P )  as  follows: 

•  {a(f)}:  the  input  flow  (inflow)  rate  to  the  SFM, 

•  {/3(i) the  service  rate,  i.e. ,  the  maximal  fluid  discharge  rate  from  the  server, 

•  {<5(i)}:  the  output  flow  (outflow)  rate  from  the  SFM,  i.e.,  the  actual  fluid  discharge 
rate  from  the  server, 

•  (x(f)}:  the  buffer  occupancy  or  buffer  content,  i.e.,  the  volume  of  fluid  in  the  buffer, 

•  {7(t)}:  the  overflow  (spillover)  rate  due  to  excessive  incoming  fluid  at  a  full  buffer. 

The  above  processes  evolve  over  a  time  interval  [0,  T]  for  a  given  fixed  T  >  0.  The 
inflow  process  (a(t)}  and  the  service-rate  process  {/?(£)}  are  assumed  to  be  right-continuous 
piecewise  constant,  with  0  <  amin  <  a(t)  <  ctmax  <  oo  and  0  <  /3min  <  @(t)  <  /3max  <  oo. 
Let  9  denote  the  size  of  the  buffer,  which  is  the  variable  parameter  we  will  concentrate 
on  for  the  purpose  of  IPA.  The  processes  {a(i)}  and  {/3(f)},  along  with  the  buffer  size  9 , 
define  the  behavior  of  the  SFM.  In  particular,  they  determine  the  buffer  content,  x{9]  t ),  the 
overflow  rate  y(0;  t),  and  the  output  flow  5(9;  t).  The  notational  dependence  on  9  indicates 
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that  we  will  analyze  performance  metrics  as  functions  of  the  given  8.  We  will  assume  that 
the  real-valued  parameter  9  is  confined  to  a  closed  and  bounded  (compact)  interval  0;  to 
avoid  unnecessary  technical  complications,  we  assume  that  9  >  0  for  all  9  €  0. 

The  buffer  content  x(9;  t )  is  determined  by  the  following  one-sided  differential  equation, 


dx(9]  t) 
dt+ 


0,  if  x(9;  t)  =  0  and  a(t)  —  (3(t)  <  0, 

0,  if  x(9;  t)  =  8  and  a(t)  —  (3(t)  >  0, 

a(t)  —  /3(f),  otherwise 


(B.6) 


with  the  initial  condition  x(0;O)  =  xo  for  some  given  xo;  for  simplicity,  we  set  xo  =  0 
throughout  the  paper.  The  outflow  rate  6(9 ;  t)  is  given  by 


wt) 


(J(t),  if  x(9;  t)  >  0, 
a(t),  if  x(9;  t )  =  0, 


(B.7) 


where  we  point  out  that  if  we  allow  0  =  0,  then  6(9',  t)  =  min{a(i),  /3(t)}.  The  overflow 
rate  y(0;  t)  is  given  by 


(  max{a(f)  —  fi(t),  0},  if  x(9;t)  =  9, 
\  0,  if  x(9;  t )  <  9. 


(B.8) 


This  SFM  can  be  viewed  as  a  dynamic  system  whose  input  consists  of  the  two  defining 
processes  {a(t)}  and  {(3(t)}  along  with  the  buffer  size  9,  its  state  is  comprised  of  the  buffer 
content  process,  and  its  output  includes  the  outflow  and  overflow  processes.  The  state  and 
output  processes  are  referred  to  as  derived  processes,  since  they  are  determined  by  the 
defining  processes.  Since  the  input  sample  functions  (realizations)  of  {a(f)}  and  {/3(t)}  are 
piecewise  constant  and  right-continuous,  the  state  trajectory  x(9;  t )  is  piecewise  linear  and 
continuous  in  t,  and  the  output  function  q(0;  t)  is  piecewise  constant.  Moreover,  the  state 
trajectory  can  be  decomposed  into  two  kinds  of  intervals:  empty  periods  and  busy  periods. 
Empty  Periods  (EP)  are  maximal  intervals  during  which  the  buffer  is  empty,  while  Busy 
Periods  (BP)  are  supremal  intervals  during  which  the  buffer  is  nonempty.  Observe  that 
during  an  EP  the  system  is  not  necessarily  idle  since  the  server  may  be  active;  see  (B.7). 
Note  also  that  since  x(9;  t )  is  continuous  in  t,  EPs  are  always  closed  intervals,  whereas  BPs 
are  open  intervals  unless  containing  one  of  the  end  points  0  or  T.  The  outflow  process  {£(£)} 
becomes  important  in  modeling  networks  of  SFMs  and  it  will  not  concern  us  any  further 
here,  since  our  interest  in  this  paper  lies  in  single-node  systems. 


Let  C(8)  :  0  — ►  M  be  a  random  function  defined  over  the  underlying  probability  space 
(^l,  if,  P).  Strictly  speaking,  we  write  C(9,u>)  to  indicate  that  this  sample  function  depends 
on  the  sample  point  w  G  f!,  but  will  suppress  u  unless  it  is  necessary  to  stress  this  fact. 
In  what  follows,  we  will  consider  two  performance  metrics,  the  Loss  Volume  Lt(9)  and  the 
Cumulative  Workload  (or  just  Work)  Qt(9),  both  defined  on  the  interval  [0,  T]  via  the 
following  equations: 


Lt(9)  =  [  7 (9;t)dt, 

JO 

(B.9) 

rT 

Qt(9 )  =  /  x(9;t)dt, 

Jo 

(B.10) 
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where,  as  already  mentioned,  we  assume  that  x(9;  0)  =  0.  Observe  that  [Lt(0)\  is  the 
Expected  Loss  Rate  over  the  interval  [0,T],  a  common  performance  metric  of  interest  (from 
which  related  metrics  such  as  Loss  Probability  can  also  be  derived).  Similarly,  ^E[Qt(0)] 
is  the  Expected  Buffer  Content  over  [0,T].  We  may  then  formulate  optimization  problems 
such  as  the  determination  of  6*  that  minimizes  a  cost  function  of  the  form 

m  =  \ E[Qt{8 )]  +  *E[LT(0)}  =  ± JQ(0 )  +  pL(d), 

where  R  represents  a  rejection  cost  due  to  overflow.  In  order  to  accomplish  this  task,  we  rely 
on  estimates  of  dJi,{9)/d9  and  dJg(9)/d9  provided  by  the  sample  derivatives  dLx{9) / dd  and 
dQT{0)/dd  for  use  in  stochastic  gradient-based  schemes.  Accordingly,  the  objective  of  the 
next  section  is  the  estimation  of  the  derivatives  of  Jl{9 )  and  Jq(6),  which  we  will  pursue 
through  Infinitesimal  Perturbation  Analysis  (IPA)  techniques  [41],  [20]).  Henceforth  we 
shall  use  the  “prime”  notation  to  denote  derivatives  with  respect  to  9,  and  will  proceed 
to  estimate  the  derivatives  JjfO)  and  Jq(9).  The  corresponding  sample  derivatives  are 
denoted  by  L't{9 )  and  QT(9),  respectively. 

B.4  Infinitesimal  Perturbation  Analysis  (IPA)  with  respect 
to  Buffer  Size  or  Threshold 


As  already  mentioned,  we  will  concentrate  on  the  buffer  size  6  in  the  SFM  described  above 
or,  equivalently,  a  threshold  parameter  used  for  buffer  control.  We  assume  that  the  processes 
(a(f)}  and  {/3(f)}  are  independent  of  9  and  of  the  buffer  content.  Thus,  we  consider  network 
settings  operating  with  protocols  such  as  ATM  and  UDP,  but  not  TCP.  Our  objective  is  to 
estimate  the  derivatives  JL{9 )  and  Jq(9)  through  the  sample  derivatives  LT(9 )  and  QT(9) 
which  are  commonly  referred  to  as  Infinitesimal  Perturbation  Analysis  (IPA)  estimators; 
comprehensive  discussions  of  IPA  and  its  applications  can  be  found  in  [41],  [20].  The 
IPA  derivative-estimation  technique  computes  LT(9)  and  QT(9)  along  an  observed  sample 
path  u.  An  IPA-based  estimate  C\9)  of  a  performance  metric  derivative  dE[C(9)\/d0  is 
unbiased  if  dE[C(9)]/dQ  =  E[C'{9)].  Unbiasedness  is  the  principal  condition  for  making 
the  application  of  IPA  practical,  since  it  enables  the  use  of  the  sample  (IPA)  derivative  in 
control  and  optimization  methods  that  employ  stochastic  gradient-based  techniques. 

We  consider  sample  paths  of  the  SFM  over  [0,  T].  For  a  fixed  9  6  0,  the  interval  [0,T] 
is  divided  into  alternating  EPs  and  BPs.  Suppose  there  are  K  busy  periods  denoted  by  Bk, 
k  =  1, . . . ,  K,  in  increasing  order.  Then,  by  (B.9)-(B.10),  the  sample  performance  functions 
assume  the  following  form: 


K  r 

Lt(9)  =  V  /  7 (8-,t)dt, 
k= i  Jb* 

(B.ll) 

K  r 

Qt{9)  =  V  /  x(9;t)dt. 

k= i  jBk 

(B.12) 
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As  mentioned  earlier,  the  processes  {«(£)}  and  {/3(f)}  are  assumed  piecewise  constant.  This 
implies  that,  w.p.l,  there  exist  a  random  integer  N(T )  >  0  and  an  increasing  sequence  of 
time  points  0  =  £o  <  t\  <  . . .  <  f/v(T)  <  £jv(T)+i  =  T,  generally  dependent  upon  the  sample 
path  oj,  such  that  U  is  a  jump  (discontinuity)  point  of  a(t )  —  /3(f);  clearly,  a(t)  —  /3(t)  is 
continuous  at  all  points  other  than  £o, . . . , fjvm-  We  will  assume  that  N(T)  has  a  finite 
expectation,  i.e.,  E[N(T)]  <  oo. 

Viewed  as  a  discrete-event  system,  an  event  in  a  sample  path  of  the  SFM  may  be  either 
exogenous  or  endogenous.  An  exogenous  event  is  a  jump  in  either  {«(£)}  or  {/3(f)}.  An 
endogenous  event  is  defined  to  occur  when  the  buffer  becomes  full  or  empty.  We  note 
that  the  times  at  which  the  buffer  ceases  to  be  full  or  empty  are  locally  independent  of  9, 
because  they  correspond  to  a  change  of  sign  in  the  difference  function  a(t)  —  /3(f)  (by  a 
random  function  f(9)  being  “locally  independent”  of  9  we  mean  that  for  a  given  8  there 
exists  A 9  >  0  such  that  for  every  9  £  {9  —  A 9,  9  +  A 9),  w.p.l  f{9 )  =  f(9),  where  A 8  may 
depend  on  both  8  and  on  the  sample  path).  Thus,  given  a  BP  Bk,  its  starting  point  is  one 
where  the  buffer  ceases  to  be  empty  and  is  therefore  locally  independent  of  9,  while  its  end 
point  generally  depends  on  9.  Denoting  these  points  by  and  rjk(9)  we  express  Bk  as 

Bk  =  (Zk,rik(0)),  k  =  l,...,K 

for  some  random  integer  K.  The  BPs  can  be  classified  according  to  whether  some  overflow 
occurs  during  them  or  not.  Thus,  we  define  the  random  set 

*(0)  ==  {*€{1,...,*'}:  x(t)=6, 

a(t)  —  /3(f)  >  0  for  some  f  £  (£&>  %(#))}• 

For  every  k  £  4>(0),  there  is  a  (random)  number  Aik  >  1  of  overflow  periods  in  Bk,  i.e., 
intervals  during  which  the  buffer  is  full  and  a(t)  —  /3(f)  >  0.  Let  us  denote  these  over¬ 
flow  periods  by  tFk,m,  rn  =  1  in  increasing  order  and  express  them  as  Ek.m  = 

[uk,m(6),vk,m\,  k  =  1, . . .  ,K.  Observe  that  the  starting  time  Uk^m{9)  generally  depends 
on  9,  whereas  the  ending  time  Vk,m  is  locally  independent  of  9 ,  since  it  corresponds  to  a 
change  of  sign  in  the  difference  function  a(t)  —  (3(t),  which  has  been  assumed  independent 
of  9.  Finally  let 

B(9)  =  \m\  (B.13) 

where  |-|  denotes  the  cardinality  of  a  set,  i.e.,  B{9)  is  the  number  of  BPs  in  [0, T]  during 
which  some  overflow  is  observed.  To  summarize: 

•  There  are  K  busy  periods  in  [0,  T],  with  Bk  =  (£*.,  j?^#)),  k  =  1, . . . ,  K. 

•  k  £  $(0)  iff  some  overflow  occurs  during  Bk ;  we  set  B(9 )  =  |<L(0)|. 

•  For  each  k  £  <&(#),  there  are  Aik  overflow  periods  in  Bk,  i.e.,  Tk,m  =  [uk,m(9),Vk:m\, 
m  =  1, . . . ,  Mk. 

A  typical  sample  path  is  shown  in  Fig.  B.4,  where  K  =  3,  T  =  {1,  3},  M\  =  2,  M2  =  0, 

M3  =  1. 
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Figure  B.4:  A  typical  sample  path  of  a  SFM 


As  mentioned  in  the  Introduction,  we  present  two  ways  of  deriving  IPA  estimators:  (i) 
by  evaluating  the  finite  differences  A Lt(9)  and  A Qt{9)  as  functions  of  Ad,  obtaining  left 
and/or  right  sample  derivatives  (depending  on  whether  A 0  <  0  or  A 6  >  0),  taking  limits 
as  A 9  —>  0,  and  finally  exploring  if  they  yield  unbiased  estimates  of  JL(0)  and  Jq(9);  or 
(ii)  by  explicitly  evaluating  L't{9 )  and  Q't(9),  which  requires  some  additional  technical 
assumptions.  We  will  first  proceed  with  the  former  approach  and  consider  only  the  loss 
volume  metric  Lt(6)',  the  analysis  for  Qt{9)  is  similar,  though  a  bit  more  involved  (see 
also  [15]).  In  pursuing  this  approach,  we  will  also  derive  some  results  that  will  be  used  to 
establish  the  unbiasedness  of  the  estimators  LT{9)  and  QT{9)  obtained  through  the  latter 
approach. 


B.4.1  IPA  Using  Finite  Difference  Analysis 

The  stochastic  component  of  the  SFM  manifests  itself  in  the  duration  of  the  intervals  defined 
by  exogenous  event  occurrences  corresponding  to  jumps  in  either  a(t )  or  (3(t).  Let  {A*}, 
z  =  1,2,...,  be  the  point  process  defined  by  these  exogenous  event  times.  For  convenience, 
let  a.i  and  denote  the  (constant)  inflow  rate  and  service  rate,  respectively,  over  the 
interval  [Aj,Aj+ 1).  Note  that  we  do  not  impose  any  restrictions  on  the  probability  law  of 
the  intervals  defined  by  these  events. 

The  main  result  of  this  section  is  to  show  that  the  sample  derivative  L't(9 ),  i.e.,  the 
sensitivity  of  the  loss  volume  with  respect  to  9,  is  given  by  —B(9),  and  that  this  is  an 
unbiased  estimator  of  JL(0).  Recall  that  B(9)  is  simply  the  count  of  busy  periods  in  which 
at  least  one  overflow  period  is  observed.  Moreover,  this  remarkably  simple  estimator  is 
independent  of  any  assumptions  on  the  traffic  process  or  service  process,  as  well  as  of  the 
rates  involved  and  even  9,  i.e.,  it  is  nonparametric. 

The  starting  point  in  IPA  is  to  consider  a  nominal  sample  path  under  some  buffer  size 
(equivalently,  admission  threshold)  9  and  a  perturbed  sample  path  resulting  from  perturbing 
9  by  A 9,  while  keeping  the  realizations  of  the  processeses  |a(f)}  and  {/?(£)}  unchanged, 
hence  leaving  {A.;},  i  =  1,2, . ..,  unchanged.  For  simplicity,  we  limit  ourselves  to  the  case 
where  A 9  >  0,  leading  to  an  estimate  of  the  right  sample  derivative  of  Lt(9);  the  case 
where  A 9  <  0  is  similar,  leading  to  an  estimate  of  the  left  sample  derivative  of  Lt(9).  We 
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then  define 


A Xi(6,  Ad)  =  Xi(9  +  A 9)  -  Xi(6), 

where  Xi(0)  denotes  the  nominal  sample  buffer  content  at  time  A%  and  Xi{9  +  AO)  denotes 
the  perturbed  sample  buffer  content  at  the  same  time.  Similarly,  we  define  perturbations 
for  some  additional  sample  path  quantities  as  follows.  First,  setting  Aq  =  0,  let 

Li(0)=  7 (8\t)dt,  i  =  1,2,...  (B.14) 

JAi_  i 


be  the  total  loss  volume  observed  over  an  interevent  interval  [A;_ i,  A*),  and  define 


A Li{6,  A 9)  =  Li{9  +  AO)  -  U{9) 


(B.15) 


In  addition,  let 


Ui+i(0)  =  Xi(9)  +  ( at  -  Pi)[Ai+ 1  -  Ai\ 


(B.16) 


and  note  that  (a^  —  /3J[Aj+i  —  A*]  is  simply  the  amount  of  change  in  the  buffer  content 
from  time  Aj  to  time  Al+\.  Therefore,  (0)  is  the  queue  content  obtained  at  time  Aj+i 
if  the  queue  were  allowed  to  become  negative  or  to  exceed  9.  We  may  then  define 


A yi(0,  A 9)  =  yi{9  +  AO)  -  yi{9) 


Finally,  we  define  a  perturbation  in  the  ending  time  of  a  BP  as 


Avk(9,A0)  =  r]k(0  +  AO)  -r]k(0),  k  =  1,2,... 

For  notational  simplicity,  we  shall  henceforth  suppress  the  arguments  of  all  quantities  Axt, 
A yi,  A Lh  Ar]k. 

Consider  a  typical  BP,  Bk,  and  all  possible  events  that  can  take  place  in  it,  so  as  to 
determine  how  associated  perturbations  are  either  generated  (due  to  A0)  or  propagated 
from  the  previous  event.  The  kth  busy  period  is  initiated  by  an  exogenous  event  at  time 
=  A*,  for  some  i.  such  that  >  0,  and  let  us  assume  that  A Xi  =  0.  Regarding  the 

next  exogenous  event  at  time  Aj+i  there  are  two  possible  cases  to  consider: 

Case  I:  yi+\{9)  <  9.  In  this  case,  yi+\(9)  is  given  by  (B.16)  and  we  have  (see  also 
Fig.  B.5(a)): 


Xi+i  {9)  =  yi+i{9) 
Li{9)  =  0 


Clearly,  Axi+\  =  Ayi+i  =  ALi+i  =  0. 

Case  II:  yi+i(0)  >  9.  In  this  case,  the  queue  content  in  the  perturbed  path  can  increase 
beyond  9  up  to  the  perturbed  value  9  +  AO.  Then,  as  also  seen  in  Fig.  B.5(b), 

Axi+i  =  A  9  (B.17) 

A  Li+1  =  -A  9  (B.18) 
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Figure  B.5:  (a)  Case  I:  No  perturbation  generation  (yi+\(9)  <  9).  (b)  Case  II:  Perturbation 
generation  for  0  <  A 9  <  yi+\(9)  —  9 


provided  that  A 9  is  such  that  0  <  A 9  <  yi+\(9)  —  9.  To  consider  the  case  where  A 9  > 
yi+i(0)  —  9,  let  the  length  of  the  overflow  period  in  the  nominal  path  be  Ft  and  note  that 

_  yi+ 1  -  9 

~  a  ' 

OLi  -  Pi 

Thus,  if  A 9  >  yi+\{9)  —  9  =  (c^  —  PflFi,  then  it  is  easy  to  see  that  the  shaded  area  in  Fig. 
B.5(b)  reduces  to  a  triangle  with  area  \(a.i  —  flflFf.  We  then  get 

Axi+i  =  (a*  -  Pi)Fi  (B.19) 

A  Li+1  =  -( ai-Pi)Fi  (B.20) 

Using  the  standard  notation  [x]+  =  max(x,  0),  we  can  combine  (B.17)-(B.18)  with  (B.19)- 
(B.20)  to  write 

Ax’i+i  =  A9-[A9-(ai-pl)Fi\+  (B.21) 

ALi+1  =  -A9  +  [A9-(ai-pi)Fl}+  (B.22) 

Equations  (B.21)-(B.22)  capture  the  perturbation  generation  process  due  to  A 9.  The  next 
step  is  to  study  how  perturbations  can  be  propagated,  assuming  the  general  situation  A Xi  > 
0.  Doing  so  leads  to  the  following  result,  which  describes  the  complete  queue  content 
perturbation  dynamics  and  establishes  bounds  for  A x%. 


Lemma  4  For  all  i  =  1,2,..., 


0  <  Axi  <  A 9 


and 


f  [A Xi  -  (Pi  -  a.i)Ii}+  if  ai-  Pi  <  0 

1  A  9  -  [A  9  -  A  Xi  -  (ai  -  Pi)Fi}+  ifai~Pi>  0 


(B.23) 

(B.24) 


where  Ii  is  the  length  of  an  EP  ending  at  A/+ 1  with  /*  =  0  if  no  such  period  exists,  and  Fi 
is  the  length  of  an  overflow  period  ending  at  A,;+ 1  with  Fi  =  0  if  no  such  period  exists. 


Proof.  See  Appendix. 
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An  immediate  consequence  of  Lemma  4  is  that  a  queue  content  perturbation  may  prop¬ 
agate  across  busy  periods  depending  on  the  length  of  the  EP  separating  these  busy  periods. 
This  is  because  Axj+i  =  [Axi  —  {(3i  —  ai)Ii}+  >  0  when  an  event  occurs  at  time  Ai+i  that 
ends  an  EP  of  length  Moreover,  recalling  that  the  endpoints  of  busy  periods  are  denoted 
by  gk(9),  k  =  1,2, . . the  perturbation  in  r]k(9)  can  be  easily  obtained  by  noticing  in  Fig. 
B.8(a)  (Case  1.2  in  the  proof  of  Lemma  4)  that 

A%(#)  =  **  ’  (B-25) 
Pi  &i 

provided  that  A9  <  (/?j  —  af)Ii ,  where  a*  and  Pi  are  the  inflow  rate  and  service  rate  at 
the  time  the  BP  ends.  To  account  for  the  fact  that  the  /cth  BP  may  contain  an  overflow 
interval  of  length  F%  with  A 9  >  (ck;  —  Pl)Fl  +  A Xi,  A 9  in  (B.25)  can  be  replaced  by  A 9  — 
[A#  —  A  Xi  —  (a,  —  Pi)Fi]+  <  A  9  in  view  of  (B.24).  If,  on  the  other  hand,  A  9  >  (Pt  —  apli, 
then  the  kth  and  ( k  +  l)th  busy  periods  are  merged,  which  implies  that  A rjk(9)  includes 
the  entire  length  of  the  ( k  +  l)th  busy  period. 

Next,  we  identify  bounds  for  A Lj  (a  generalization  of  the  bounds  for  Axi  and  A L*  can 
also  be  found  in  [80]). 

Lemma  5  For  all  i  =  1,2,..., 

-A  9  <  A  Li  <  0  (B.26) 


Proof.  See  Appendix. 

Recall  that  if  at  least  one  overflow  period  is  observed  in  the  fcth  BP,  then  k  £  $(0). 
Making  use  of  the  standard  indicator  function  l[k  £  $(0)]  =  1  if  k  £  $(0)  and  zero 
otherwise,  we  have  the  following  result,  which  allows  us  to  characterize  the  cumulative  loss 
perturbation  at  the  end  of  a  BP,  which  we  will  denote  by  Ak(A9),  k  =  1, . . . ,  K. 

Lemma  6  Consider  a  BP  Bk  =  (£&,%($))  with  =  Aj,  A Xj  =  0,  and  Am  <  7]k(9)  < 
Am+ 1.  Assuming  A  9  —  A  xi  —  (a*  —  fij)F%  <  0  for  all  i  =  j, . . .  ,m,  the  cumulative  loss 
perturbation  at  the  end  of  this  busy  period  is 

Ak(A9)  =  -A61[k  £  $(0)],  k  =  1, . . . ,  K  (B.27) 

Proof.  See  Appendix. 

In  simple  terms,  the  loss  perturbation  depends  only  on  the  presence  of  an  overflow  within 
the  observed  busy  period  and  not  their  number.  It  is  noteworthy  that  this  perturbation 
does  not  explicitly  depend  on  any  values  that  a(t)  or  /3(f)  may  take  or  the  nature  of  the 
stochastic  processes  involved.  Considering  Lemma  6,  note  that  it  allows  us  to  analyze  all 
busy  periods  separately  and  accumulate  loss  perturbations  at  the  end  of  the  sample  path 
over  all  busy  periods  observed;  this,  however,  is  contingent  on  the  fact  that  A  Xi  =  0  when 
a  BP  starts  with  an  exogenous  event  at  A{.  On  the  other  hand,  we  saw  that  a  consequence 
of  Lemma  4  is  Axj+i  =  [A Xi  —  —  ai)Ii]+  following  an  EP  of  length  i.e. ,  the  buffer 

content  perturbation  may  not  be  zero  when  a  BP  starts,  depending  on  the  length  of  the  EP 
separating  it  from  the  preceding  BP. 
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We  can  now  derive  an  unbiased  derivative  estimate  for  our  performance  metric  by  es¬ 
tablishing  the  following  result. 


Theorem  7  The  (right)  derivative  of  the  Expected  Loss,  E[Lt(u)\,  is  given  by 

I< 


dE[LT{9)\ 

d9 


=  —E 


lk=l 


=  ~E[B(d)} 


(B.28) 


where  K  is  the  (random)  number  of  busy  periods  contained  in  [0,  T],  including  a  possibly 
incomplete  last  busy  period. 

Proof.  We  have 


dE[LT{9)\ 

dd 


lim  -r-7.E  [A Lt(6)\ 

AO^O  Ad  1  WJ 


lim  — -E 
ao^o  A 9 


I\ 


EA^(A0) 


Lfc=l 


where  Afc(A 9)  =  —  A91[k  £  3>(0)]  from  Lemma  6,  provided  A 9  —  A Xi  —  («*  —  Pf)Fi  <  0  for 
ah  At  £  [0,T].  It  follows  that 


dE[LT{9)\ 

d9 


=  —E 


'  K 

E1  [k^m] 


U=i 


-E[B(9)\ 


where  we  have  used  the  definition  in  (B.13). 

If  A  9  —  A  Xi  —  (a*  —  Pi)Fj  >  0  for  some  A,  £  [0,  T ] ,  then  the  only  additional  effect  comes 
from  AL,;+i  =  — (a*  —  Pf)Fi  <  0  in  (B.53).  Then,  consider 


E[-(ai  -  pjFi  |  A 9  -  Ax*  >  (a,  - 

A0—Axi 

— xf(x)dx 

where  /(•)  is  the  conditional  pdf  of  (a*  —  Pf)Fi  given  A 9  —  A x*  >  (a*  —  Pf)Fi ,  and  let 
/(•)  <  c  <  oo.  Recalling  that  0  <  Ax,  <  A 9  from  Lemma  4,  we  get 


cAO—Axi  rAQ 

—xf(x)dx  >  /  —xf(x)dx 


>o 


Jo 

/■AO 

>  /  —A  9f(x)dx 

Jo 

[■AO 

A  9 


[■LAU 

1  /  —  cdx  =  — c(A  9Y 

Jo 


and  it  follows  that 

£[-(a*-/?*)F*]  >  — c(A0)2 


(B.29) 
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The  cumulative  loss  perturbation  due  to  events  such  that  ALj+i  =  —  (ctj  —  /3{)Fi  is  bounded 
from  below  by 

N(T) 

Y  -(«i  -  Pi)Fi, 

1=1 

where  F,  is  the  length  of  an  overflow  interval  after  the  ith  exogenous  event,  with  Fj  =  0  if 
no  such  overflow  interval  is  present,  and  N(T )  is  the  total  number  of  exogenous  events  in 
[0,  T].  This  cumulative  loss  perturbation  is  also  bounded  from  above  by  0,  since  A L;  <  0 
from  Lemma  5.  Using  (B.29),  we  get,  given  some  N(T), 


lim  — —  E 
A8^0  A 9 


N(T) 

Y  -(ai-Pi)Fi 


i=  1 


l 


N(T) 


>  lim  —  Y  -c(A6»)2 
~A8^0  A9  ^ 

i= 1 


=  lim  \—cA9N(T)} 
AB^O 


and 


lim  \-cA9]  E  [A(T)1  =  0 
Ae->o 


where,  by  assumption,  E[N(T)]  <  oo.  This  completes  the  proof.  ■ 


An  immediate  implication  of  this  theorem  is  that  —B{9)  is 
dE[Lr(6)]/dB: 

~dE[LT{0)\ 


d6 


=  —B{6) 


J  est 


an  unbiased  estimator  of 
(B.30) 


This  estimator  is  extremely  simple  to  implement:  (B.30)  is  merely  a  counter  of  all  busy 
periods  observed  in  [0,  T]  in  which  at  least  one  overflow  takes  place.  Again,  no  knowledge 
of  the  traffic  or  processing  rates  is  required,  nor  does  (B.30)  depend  on  the  nature  of  the 
random  processes  involved. 


Using  the  finite  difference  approach  above,  it  is  also  possible  to  derive  an  unbiased 
estimator  for  dE[QT{0)]/d9  (see  [15]),  but  it  is  considerably  more  tedious;  we  will  see  how 
to  derive  the  same  estimator  in  the  next  section  by  simpler  means.  Finally,  note  that  (B.28) 
was  derived  using  A 9  >  0;  thus,  the  analysis  has  to  be  repeated  for  A 9  <  0  in  order  to 
evaluate  the  left  sample  derivative,  and,  although  this  does  not  present  any  conceptual 
difficulties,  it  adds  to  the  tediousness  of  the  finite  difference  analysis  we  have  pursued  thus 
far. 


B.4.2  IPA  Using  Sample  Derivatives 

In  this  subsection,  we  derive  explicitly  the  sample  derivatives  L't(9 )  and  QT(9)  of  the  loss 
volume  and  work,  defined  in  (B.ll)  and  (B.12),  respectively.  We  then  show  that  they 
provide  unbiased  estimators  of  the  expected  loss  volume  sensitivity  dE[L,T(9)]/d9  and  the 
expected  work  sensitivity  dE[QT(9)\/d9. 
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Since  we  are  concerned  with  the  sample  derivatives  L't(9 )  and  Q't(9),  we  have  to  identify 
conditions  under  which  they  exist.  Observe  that  any  endogenous  event  time  (a  time  point 
when  the  buffer  becomes  full  or  empty)  is  generally  a  function  of  9;  see  also  (B.6).  Denoting 
this  point  by  t(9),  the  derivative  t’{9)  exists  as  long  as  t{9)  is  not  a  jump  point  of  the 
difference  process  {a(f)  —  (3(t)}.  Recall  that  the  times  at  which  the  buffer  ceases  to  be  full 
or  empty  are  locally  independent  of  9,  because  they  correspond  to  a  change-of-sign  of  the 
difference  sample  function  a(t)  —f3{t),  which  does  not  depend  on  9.  Excluding  the  possibility 
of  the  simultaneous  occurrence  of  two  events,  the  only  situation  preventing  the  existence 
of  the  sample  derivatives  LT(9 )  and  QT(9 )  involves  an  interval  during  which  x(t )  =  9  and 
a(t)  —  (3{t)  =  0,  as  seen  in  (B.8) ) ;  in  this  case,  the  one-sided  derivatives  of  Lt(9) and  Qt{0) 
exist  and  can  be  obtained  with  the  approach  of  the  previous  section.  In  order  to  keep  the 
analysis  simple,  we  focus  only  on  the  differentiable  case.  Therefore,  the  analysis  that  follows 
rests  on  the  following  technical  conditions: 

Assumption  1. 

a.  W.p.l,  a(t )  —  (3(t )  /  0. 

b.  For  every  9  6  0,  w.p.l,  no  two  events  may  occur  at  the  same  time. 

Remark.  We  stress  the  fact  that  the  above  conditions  for  ensuring  the  existence  of  the 
sample  derivatives  LT[9)  and  QT{9)  are  very  mild.  Part  b  above  is  satisfied  whenever  the 
cdf’s  (or  conditional  cdf ’s)  characterizing  the  intervals  between  exogenous  event  occurrences 
are  continuous.  For  example,  in  the  simple  case  where  (3(t)  =  f3  and  a{t )  can  only  take  two 
values,  0  and  a  >  (3,  suppose  that  the  inflow  process  switches  from  a  to  0  after  9 /(a  —  (3) 
time  units  w.p.  1.  The  buffer  then  becomes  full  exactly  when  an  exogenous  event  occurs, 
and  the  loss  volume  sample  function  experiences  a  discontinuity  w.p.  1.  Such  situations 
can  only  arise  for  a  small  finite  subset  of  0  (for  which  one  can  still  calculate  either  the  left 
or  right  derivatives)  and  they  are  of  limited  practical  consequence. 

We  next  derive  the  IPA  derivatives  of  Lt{9) and  Qt{0)-  Recall  that  B{9)  =  |<h(^)|,  i.e., 
the  number  of  BPs  containing  at  least  one  overflow  period. 


Theorem  8  For  every  9  £  0, 

Lt(9)  =  —B(9).  (B.31) 


Proof.  Recalling  that  Bk  =  (£&,%(#)),  we  have,  from  (B.ll), 

rvM 

lt(0)  =  E  /  7  (0;t)dt, 


which  after  differentiation  yields 


Lt(0)  =  E  T9  re)^t)dt 

fce$(0)  a  J^k 


(B.32) 


(B.33) 


Note  that  the  derivative  in  (B.33)  is  taken  along  a  sample  path.  The  set  $(0),  though 
depending  on  9,  can  be  viewed  as  a  constant  for  the  purpose  of  taking  the  derivative.  The 
reason  is  that,  by  virtue  of  Assumption  lb,  it  is  locally  independent  of  9,  similarly  to 
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the  endogenous  event  times  discussed  in  the  first  part  of  Section  4  (i.e.,  for  every  fixed  9, 
w.p.l  there  exists  A 9  >  0,  such  that,  for  every  9  £  [9  —  A 9,  9  +  A0],  <E>(0)  =  $(0);  although 
this  A 9  generally  depends  on  the  given  sample  path,  our  derivative  is  taken  along  a  specific 
sample  path,  hence  (B.33)  is  justified). 

Next,  we  focus  on  a  particular  Bk  with  k  £  4>(0)  and  we  shall  suppress  the  index  k  to 
simplify  the  notation.  Accordingly,  the  BP  in  question  is  denoted  by  B  =  (£,??( 6 )),  and 
there  are  M  >  1  overflow  periods  in  B ,  denoted  by  Tm  =  [um(9),  vm],  m  =  1, . . . ,  M.  A 
typical  scenario  is  depicted  in  Fig.  B.4,  where  in  the  first  BP  we  have  M  =  2.  The  loss 
volume  over  B  is  given  by  the  function 

rv(0) 

A  (6)  =  J  7  (9;t)dt.  (B.34) 

We  next  prove  that 

A \9)  = 

from  which  Eq.  (B.31)  immediately  follows  in 
of  7 (0;t)  in  (B.8),  we  can  rewrite  (B.34)  as 

M  n’m 

A (6)  =  /  [a(t)  —  f3(t)]dt.  (B.36) 

Since  the  points  um{9),  m  =  1, . . . ,  M ,  and  the  jump  points  of  a(t)  —  (3(t )  constitute  events , 
and  since  w.p.  1  no  two  events  can  occur  at  the  same  time  by  Assumption  lb,  the  function 
a(t)  —  /3(f)  must  be  continuous  w.p.  1  at  the  points  um{9 ),  m  =  1, . . . ,  M.  Consequently, 
by  taking  derivatives  with  respect  to  9  in  (B.36)  we  obtain, 

M 

A '(6)  =  -  y;  [a(um(6>))  -  (3 (um(d))]um(Q) .  (B.37) 

m=  1 

Next,  consider  the  individual  terms  in  the  above  sum  (see  also  Fig.  B.4  for  an  illustration). 


-1,  (B.35) 

view  of  (B.32)-(B.34).  From  the  definition 


1.  If  m  =  1,  then  the  buffer  is  neither  full  nor  empty  in  the  interval  (£,  u\(9)).  Since  the 
buffer  content  evolves  from  x(£)  =  0  to  x(ui(6))  =  9,  (B.6)  implies 


[a(t)  —  P(t)]dt 


0, 


and,  upon  taking  derivatives  with  respect  to  9, 

[a(ui(0))  - /?(ui(0))]?4(0)  =  1. 


(B.38) 


2.  If  m  >  1,  then  the  buffer  is  neither  full  nor  empty  in  the  interval  (ym-\,  um{9)).  Since 
x(vm- 1)  =  x(um(0))  =  9  we  obtain,  by  (B.6), 


rUm(O) 

/  [a(t)  —  P(t)]dt  =  0, 

J  Vm  —  1 
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and  upon  differentiating  with  respect  to  9, 

[■ a(um(6 ))  -  (3 (um(6))]u'm(9)  =  0  (B.39) 

Finally,  Eqs.  (B.37),  (B.38)  and  (B.39)  imply  (B.35),  which  immediately  implies  (B.31) 
and  the  proof  is  complete.  ■ 

Note  that  Theorem  8  is  consistent  with  Theorem  7.  However,  Theorem  7  includes  a 
direct  proof  of  the  unbiasedness  of  the  estimator  —B(9),  whereas  the  present  approach 
requires  a  separate  proof  that  the  sample  derivative  LT{9)  =  —B(9)  is  in  fact  unbiased. 
The  unbiasedness  of  this  IPA  derivative  will  be  proven  later,  after  we  establish  the  IPA 
derivative  of  the  work  Qt{9)  defined  in  (B.12). 

Theorem  9  For  every  9  £  0, 

Qt(0)  =  £  (B-4°) 

fce$(0) 


Proof.  We  focus  on  a  particular  BP  Bk  with  k  £  4>(0),  and  again  suppress  the  notational 
dependency  on  k  for  the  sake  of  simplicity.  Accordingly,  consider  a  BP  Bk  =  (£,r](9)),  and 
denote  its  overflow  periods  by  J~m  =  [um(9),vm],  m  =  for  some  M  >  1  (e.g., 

M  =  2  in  the  first  BP  of  Fig.  B.4).  Define  the  function 

rv(0) 

q{9)  =  j  x(9;t)dt.  (B.41) 

It  suffices  to  prove  that 

q'(8)  =  n{9)-u  i(0)  (B.42) 

since  this  would  immediately  imply  (B.40).  Since  x(9;t)  is  continuous  in  t,  taking  the 
derivative  with  respect  to  8  in  (B.41)  and  letting  x'(9;t)  denote  the  partial  derivative  with 
respect  to  9  yields 

rv(0)  rv{0) 

q'(9)  =  J  x'(9;  t)dt  +  x(9;  r](9))r]'(9)  =  J  x'(9;t)dt,  (B.43) 

since  the  BP  ends  at  r)(9),  hence  x{9\r]{9))  =  0.  To  evaluate  this  partial  derivative  (which 
exists  at  all  t  except  t  =  urn  and  t  =  vm)  we  consider  all  possible  cases  regarding  the  location 
of  t  in  the  BP  Bk  =  (C;7?^))  (see  Fig.  B.4): 

1.  t  G  (£,  ui{9)).  In  this  case,  the  buffer  is  neither  empty  nor  full  in  this  interval.  It 
follows,  using  (B.6),  that 


x(9;t)  =  J  \a(r)  —  (3(r)\dT . 

Since  the  right-hand  side  above  is  independent  of  9,  we  have  x'(9;  t )  =  0. 
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2.  t  E  (um(9),  vm),  m  =  1 . . . ,  Af.  Since  (um(9).  vrn)  is  an  overflow  period,  x(9,t)  =  9  in 
these  intervals,  hence  x'(9;t)  =  1. 

3.  t  E  (um,  ttm+i(0)),  m  =  1, . . . ,  M  —  1.  Here,  the  buffer  is  neither  empty  nor  full  in  the 
interval  (vm.  t),  while  x(9;vm)  =  9.  It  follows,  using  (B.6),  that 

x{9]t)  =  9  +  f  [a(r)  —  (3 (t)\cIt, 

J  Vm 

and  upon  differentiating  with  respect  to  9,  we  obtain  x'(9;t)  =  1. 

4.  t  E  (vm,v(0))-  This  case  is  identical  to  the  previous  one,  yielding  x'(9;t)  =  1. 

In  summary,  x'(9;t )  =  0  for  all  t  E  (£,iti(0))  (Case  1),  and  x'(9;t )  =  1  for  all  t  E 
(ui(9) ,  i](9))  (Cases  2-4).  Therefore,  it  follows  from  (B.43)  that  (B.42)  holds,  implying 
(B.40)  and  completing  the  proof.  ■ 

In  simple  terms,  the  contribution  of  a  BP,  to  the  sample  derivative  QT{9 )  in  (B.40) 
is  the  length  of  the  interval  defined  by  the  first  point  at  which  the  buffer  becomes  full 
and  the  end  of  the  BP.  Once  again,  as  in  (B.31),  observe  that  the  IPA  derivative  QT(9)  is 
nonparametric,  since  it  requires  only  the  recording  of  times  at  which  the  buffer  becomes  full 
(i.e.,  Ukti{9))  and  empty  (i.e.,  ?/fc(#))  f°r  any  &k  with  k  E  4>(6)).  We  also  remark  that  the 
same  IPA  derivative  can  be  obtained  through  the  finite  difference  analysis  of  the  previous 
section  (see  [15]),  but  with  considerably  more  effort. 


IPA  Unbiasedness 

We  next  prove  the  unbiasedness  of  the  IPA  derivatives  LT(9 )  and  QT(9)  obtained  above.  Al¬ 
though  we  have  already  shown  in  (B.28)  that  —B{9)  is  an  unbiased  estimate  of  dE[Lx{9)\/d9^ 
we  supply  an  alternative  and  greatly  simplified  proof  based  on  the  direct  derivation  of  the 
IPA  estimator  in  this  section  and  on  some  of  the  results  of  the  finite-difference  analysis  in 
Section  4.1.  By  a  similar  technique,  we  also  supply  a  proof  of  the  unbiasedness  of  the  IPA 
estimator  QT{9)  in  (B.40).  These  proofs,  jointly  with  the  sample-derivative  technique  for 
obtaining  the  estimators,  suggest  the  possibility  of  extensive  generalizations  to  the  func¬ 
tional  forms  of  a(t)  and  /3(f)  (beyond  piecewise  constant),  to  be  explored  in  a  forthcoming 
paper  (also,  see  [78],  [79]). 

In  general,  the  unbiasedness  of  an  IPA  derivative  C(9)  has  been  shown  to  be  ensured 
by  the  following  two  conditions  (see  [71],  Lemma  A2,  p.70): 

Condition  1.  For  every  9  E  0,  the  sample  derivative  C{9)  exists  w.p.l. 

Condition  2.  W.p.l,  the  random  function  C{9)  is  Lipschitz  continuous  throughout  0, 
and  the  (generally  random)  Lipschitz  constant  has  a  finite  first  moment. 

Consequently,  establishing  the  unbiasedness  of  L't(9)  and  QT(9 )  as  estimators  of  dE[Lx{9)\/d9 
and  dE[Qx{9)]/d9,  respectively,  reduces  to  verifying  the  Lipschitz  continuity  of  Lt{9)  and 
Qt(9)  with  appropriate  Lipschitz  constants.  Recall  that  N(T)  is  the  random  number  of  all 
exogenous  events  in  [0,T]  and  that  we  have  assumed  E[N(T)]  <  oo. 
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Theorem  10  Under  Assumption  1, 

1.  If  E[N(T)\  <  oo,  then  the  IPA  derivative  L't(9 )  is  an  unbiased  estimator  of  dE[Lx(9)]/ dO . 

2.  The  IPA  derivative  Q't(9 )  is  an  unbiased  estimator  of  dE[Qx(9)\/d9. 

Proof.  Under  Assumption  1,  Condition  1  holds  for  Lt(9)  and  Qt(9).  Therefore,  it 
only  remains  to  establish  Condition  2. 

First,  consider  Lt(9).  Recalling  (B.14)  and  (B.15),  we  can  write 

N(T) 

A Lt(0)  =  ^2 
1=1 

by  partitioning  [0,  T]  into  intervals  [Aj_i,  Af)  defined  by  successive  exogenous  events.  Then, 
by  Lemma  5,  —A 6  <  A L*  <  0,  so  that 


\ALT(9)\  <  N(T)  |A0|  , 


i.e.,  Lt(0)  is  Lipschitz  continuous  with  constant  N(T).  Since  E[N(T)]  <  oo,  this  establishes 
unbiasedness. 


Consider  next  the  sample  function  Qt(0),  defined  by  (B.12)  and  fix  9  and  A 9  >  0.  By 
Lemma  4,  0  <  Ai,  <  A 9,  hence  the  difference  A x(9,  A 9;  t )  :=  x(9  +  A 9\  t )  —  x(6;  t)  satisfies 
the  inequalities 

0  <  Ax{9,A9;t)  <  A 9. 

Consequently,  in  view  of  (B.12), 


\aQt{9)\ 


A x(9,  A 9;  t)dt 


<T\A9\ , 


that  is,  Qt(9)  is  Lipschitz  continuous  with  constant  T.  This  completes  the  proof.  ■ 

Remark.  For  the  more  commonly  used  performance  metrics  ^ E  \Lt(9)\  (the  Expected 
Loss  Rate  over  [0,T])  and  [Qt(9)\  (the  Expected  Buffer  Content  over  [0,T]),  the  Lips¬ 
chitz  constants  in  Theorem  10  become  N{T)/T  and  1,  respectively.  As  T  — >  oo,  the  former 
quantity  typically  converges  to  the  exogenous  event  rate. 


B.5  Optimal  Buffer  Control  Using  SFM-Based  IPA  Estima¬ 
tors 


As  suggested  in  Section  2  and  illustrated  in  Fig.  B.2,  the  solution  to  an  optimization  problem 
defined  for  an  actual  network  node  (i.e.,  a  node  that  operates  as  a  queueing  system)  may  be 
accurately  approximated  by  the  solution  to  the  same  problem  based  on  a  SFM  of  the  node. 
However,  this  may  not  be  always  the  case.  On  the  other  hand,  the  simple  form  of  the  IPA 
estimators  of  the  Expected  Loss  Rate  and  Expected  Buffer  Content  obtained  through  (B.31) 
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and  (B.40)  allows  us  to  use  data  from  the  actual  (real-world)  system  in  order  to  estimate 
sensitivities  that,  in  turn,  may  be  used  to  solve  an  optimization  problem  of  interest.  In  other 
words,  the  form  of  the  IPA  estimators  is  obtained  by  analyzing  the  system  as  a  SFM,  but 
the  associated  values  are  based  on  real  data.  In  particular,  an  algorithm  for  implementing 
the  estimators  (B.31)  and  (B.40)  is  given  below: 

IPA  Estimation  Algorithm 

•  Initialize  a  counter  C  :=  0  and  a  cumulative  timer  T  :=  0. 

•  Initialize  r  :=  0. 

•  If  an  overflow  event  is  observed  at  time  t  and  r  =  0: 

—  Set  r  :=  t 

•  If  a  busy  period  ends  at  time  t  and  r  >  0: 

—  Set  C  :=  C  —  1  and  T  :=  T  +  (t  —  r) 

—  Reset  t  :=  0. 

•  If  t  =  T,  and  r  >  0: 

—  Set  C  :=  C  —  1  and  T  :=  T  +  (t  —  r). 

The  final  values  of  C  and  T  provide  the  IPA  derivatives  L't(9)  and  Q't{9)  respectively. 
We  remark  that  the  “overflow”  and  “end  of  BP”  events  are  readily  observable  during  actual 
network  operation.  In  addition,  we  point  out  once  again  that  these  estimates  are  indepen¬ 
dent  of  all  underlying  stochastic  features,  including  traffic  and  processing  rates.  Finally, 
the  algorithm  is  easily  modified  to  apply  to  any  interval  [T\,  T?\. 

Let  us  now  return  to  the  buffer  control  problem  presented  in  Section  2,  where  the 
objective  was  to  determine  a  threshold  C  that  minimizes  a  cost  function  of  the  form 

Jt{C )  =  Qt{C)  +R-Lt{C ) 

trading  off  the  expected  loss  rate  with  a  rejection  penalty  R  for  the  expected  queue  length. 
If  a  SFM  is  used  instead,  then  the  cost  function  of  interest  becomes 

m  =  \ E[Qt(9)}  +  *E[Lt{9)} 

and  the  optimal  threshold  parameter,  9*,  may  be  determined  through  a  standard  stochastic 
approximation  algorithm  based  on  (B.4).  The  gradient  estimator  Hn(9,co^FM)  is  the  IPA 
estimator  of  dJ/d9  based  on  (B.31)  and  (B.40): 

Hn(9, usnFM)  =  i  £  [Vk(9)  -  ukAm  -  §B(9)  (B.44) 

fce$(0) 
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evaluated  over  a  simulated  sample  path  u^FM  of  length  T,  following  which  a  control  update 
is  performed  through  (B.4)  based  on  the  value  of  Hn(0,u;FFM). 

The  interesting  observation  here  is  that  the  same  estimator  may  be  used  in  (B.5)  as 
follows:  If  a  packet  arrives  and  is  rejected,  the  time  this  occurs  is  recorded  as  r  in  the 
algorithm  above.  At  the  end  of  the  current  busy  period,  the  counter  C  and  timer  T  are 
updated.  Thus,  the  exact  same  expression  as  in  the  right-hand  side  of  (B.44)  can  be  used 
to  update  the  threshold: 

Cn+ 1  =  Cn-  vnHn(Cn , uFES ),  n  =  0, 1, . . .  (B.45) 

Note  that,  after  a  control  update,  the  state  must  be  reset  to  0,  in  accordance  with  our 
convention  that  all  performance  metrics  are  defined  over  an  interval  [0,  T]  with  an  initially 
empty  buffer.  In  the  case  of  off-line  control,  this  simply  amounts  to  simulating  the  system 
after  resetting  its  state  to  0.  In  the  more  interesting  case  of  on-line  control,  we  proceed 
as  follows.  Suppose  that  the  nth  iteration  ends  at  time  rn  and  the  state  is  x(Cn\Tn )  (in 
general,  x(Cn\Tn )  >0).  At  this  point,  the  threshold  is  updated  and  its  new  value  is  Cn+ 
Let  >  rn  be  the  next  time  that  the  buffer  is  empty,  i.e.,  x(Cn+ 1;  r({)  =  0.  At  this  point, 
the  (n  +  l)th  iteration  starts  and  the  next  gradient  estimate  is  obtained  over  the  interval 
[Tni  Tn  +  T\,  so  that  rn+i  =  r°  +  T  and  the  process  repeats.  The  implication  is  that  over 
the  interval  [rn,  r^J  no  estimation  is  carried  out  while  the  controller  waits  for  the  system  to 
be  reset  to  its  proper  initial  state;  therefore,  sample  path  information  available  over  [rn,  r({] 
is  effectively  wasted  as  far  as  gradient  estimation  is  concerned. 

Figure  B.6  depicts  examples  of  the  application  of  this  scheme  to  a  single-node  SFM  under 
six  different  parameter  settings  (scenarios),  summarized  in  Table  1.  As  in  Fig.  B.2,  ‘DES’ 
denotes  curves  obtained  by  estimating  Jt(C)  over  different  (discrete)  values  of  C,  ‘SFM’ 
denotes  curves  obtained  by  estimating  J(9)  over  different  values  of  9,  and  ‘Opt. Algo.’  repre¬ 
sents  the  optimization  process  (B.45),  where  we  maintain  real- valued  thresholds  throughout. 
The  first  three  scenarios  correspond  to  a  high  traffic  intensity  p  compared  to  the  remaining 
three.  For  each  example,  C*  is  the  optimal  threshold  obtained  through  exhaustive  simula¬ 
tion.  In  all  simulations,  an  ON-OFF  traffic  source  is  used  with  the  number  of  arrivals  in 
each  ON  period  geometrically  distributed  with  parameter  p  and  arrival  rate  cc;  the  OFF 
period  is  exponentially  distributed  with  parameter  p;  and  the  service  rate  is  fixed  at  (5. 
Thus,  the  traffic  intensity  of  the  system  p  is  o(“^)//?(~  +  j“);  where  ^  is  the  average 
length  of  an  ON  period  and  j-t  is  the  average  length  of  an  OFF  period.  The  rejection  cost  is 
R  =  50.  For  simplicity,  vn  in  (B.45)  is  taken  to  be  a  constant  z^n  =  5.  Finally,  in  all  cases 
T  =  100,  000.  As  seen  in  Fig.  B.6,  the  threshold  value  obtained  through  (B.45)  using  the 
SFM-based  gradient  estimator  in  (B.44)  either  recovers  C*  or  is  close  to  it  with  a  cost  value 
extremely  close  to  Jt(C*)]  since  in  some  cases  the  cost  function  is  nearly  constant  in  the 
neighborhood  of  the  optimum,  it  is  difficult  to  determine  the  actual  optimal  threshold,  but 
it  is  also  practically  unimportant  since  the  cost  is  essentially  the  same.  We  have  also  imple¬ 
mented  (B.45)  with  Hn(Cn,LUFES)  estimated  over  shorter  interval  lengths  T  =  10,000  and 
T  =  5, 000,  with  virtually  identical  results.  Looking  at  Fig.  B.6,  it  is  worth  observing  that 
determining  9*  as  an  approximation  to  C*  through  off-line  analysis  of  the  SFM  would  also 
yield  good  approximations,  further  supporting  the  premise  of  this  paper  that  SFMs  pro- 
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Scenario 

P 

a 

V 

P 

P 

C* 

1 

0.99 

1 

0.1 

0.1 

0.505 

7 

2 

0.99 

1 

0.05 

0.05 

0.505 

7 

3 

0.99 

2 

0.05 

0.1 

1.01 

15 

4 

0.71 

1 

0.1 

0.1 

0.7 

13 

5 

0.71 

1 

0.05 

0.05 

0.7 

11 

6 

0.71 

2 

0.05 

0.1 

1.4 

22 

Table  B.l:  Parameter  settings  for  six  examples 


vide  an  attractive  modeling  framework  for  control  and  optimization  (not  just  performance 
analysis)  of  complex  networks. 


B.6  Conclusions  and  Future  Work 


Stochastic  Fluid  Models  (SFM)  can  adequately  describe  the  dynamics  of  high-speed  com¬ 
munication  networks,  where  they  may  be  used  to  approximate  discrete  event  models  or 
constitute  primary  models  in  their  own  right.  When  control  and  optimization  are  of  pri¬ 
mary  importance  (rather  than  performance  analysis),  a  SFM  may  be  used  as  a  means  for 
accurately  determining  an  optimal  parameter  setting,  even  though  the  corresponding  per¬ 
formance  evaluated  through  the  SFM  may  not  be  particularly  accurate.  With  this  premise 
in  mind,  we  have  considered  single-node  SFMs  from  the  standpoint  of  IPA  derivative  es¬ 
timation.  In  particular,  we  have  developed  IPA  estimators  for  the  loss  volume  and  work 
as  functions  of  the  buffer  size,  and  shown  them  to  be  unbiased  and  nonpar ametric.  The 
simplicity  of  the  estimators  and  their  nonparametric  property  suggest  their  application  to 
on-line  network  management.  Indeed,  for  a  class  of  buffer  control  problems,  we  have  shown 
how  to  use  an  optimization  scheme  (and  illustrated  it  through  numerical  examples)  for  a 
discrete  event  model  (viewed  as  a  real,  queueing-based  single-node  system)  using  the  IPA 
gradient  obtained  from  its  SFM  counterpart.  Interestingly,  there  is  no  IPA  derivative  for 
the  discrete  event  model,  since  its  associated  control  parameter  is  discrete. 

For  the  loss  volume  performance  function,  the  IPA  derivative  has  been  developed  by  two 
separate  techniques:  finite  difference  analysis,  and  a  sample  derivative  analysis.  The  former 
method  is  more  elaborate,  but  sheds  light  on  the  structure  of  the  derivative  estimator.  The 
second  method  is  more  direct  and  elegant,  but  its  unbiasedness  proof  requires  some  results 
obtained  by  the  analysis  of  the  former  method.  The  sample-derivative  method  was  also 
applied  to  the  IPA  estimator  of  the  buffer  workload  performance  function. 

The  sample  derivative  analysis  holds  the  promise  of  considerable  extensions  to  multi¬ 
ple  SFMs  as  models  of  actual  networks  and  to  multiple  flow  classes  that  can  be  used  for 
differentiating  traffic  classes  with  different  Quality-of- Service  (QoS)  requirements.  Ongoing 
research  has  already  led  to  very  encouraging  results,  reported  in  [15],  involving  IPA  estima¬ 
tors  and  associated  optimization  for  flow  control  purposes  in  multi-node  models.  Finally, 
for  the  purpose  of  session-by-session  admission  control,  preliminary  work  suggests  that  one 
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Figure  B.6:  Optimal  threshold  determination  in  an  actual  system  using  SFM-based  gradient 
estimators  -  Scenarios  1-6 


can  use  sensitivity  information  with  respect  to  inflow  rates  (which  can  be  obtained  through 
an  approach  similar  to  the  one  presented  in  this  paper)  and  contribute  to  the  development 
of  effective  algorithms,  yet  to  be  explored. 


Appendix 

Proof  of  Lemma  4:  Looking  at  any  segment  of  the  sample  path  over  an  interval 
[Ai,Ai- |_i),  there  are  two  possibilities:  either  Oj  —  (5i  <  0  or  —  /3{  >  0.  First,  suppose  that 
oti  —  Pi  <  0  and  consider  the  event  which  occurs  at  time  Ai+\.  There  are  three  cases  to 
analyze: 

Case  1.1:  yi+\{6)  >  0.  In  this  case,  as  seen  in  Fig.  B.7,  we  have: 

Ayi+i  =  Axi+1  =  A  Xi  (B.46) 

Case  1.2:  yi+i(9)  <  0  and  yi+\(6)  +  Ayi+\  <  0.  In  this  case,  as  seen  in  Fig.  B.8(a),  the 
7th  BP  ends  and  it  is  followed  by  an  EP  of  length  /*,  which  in  turn  ends  at  time  Al+i . 
Clearly, 

Axj+i  =  0  (B.47) 

Case  1.3:  yi+\{6)  <  0  and  yi+i(9)  +  Ar/j+i  >  0.  This  represents  a  situation  where  an  EP 

of  length  Ii  is  eliminated  in  the  perturbed  path,  i.e. ,  7*  <  Axi/(/3i  —  op.  As  seen  in  Fig. 
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Figure  B.7:  Case  1.1:  yi+\(9)  >  0 


Figure  B.8:  (a)  Case  1.2:  yi+i(9 )  <  0  and  yi+\(9)  +  <  0.  (b)  Case  1.3:  yi+\(9)  <  0 

and  yi+i(0)  +  Ayi+1  >  0 


B.8(b),  the  buffer  content  perturbation  becomes 

Axi+1  =  A Xi  -  (Pi  -  cti)Ii,  (B.48) 

Next,  let  us  assume  that  —  /3j  >  0.  We  then  have  three  cases  as  follows: 

Case  2.1:  yi+\(9)  <  9  and  yi+\(9)  +  At/j+i  <  9.  It  is  easy  to  see  (Fig.  B.9(a))  that  this 
is  identical  to  Case  1.1  yielding  (B.46). 

Case  2.2:  yi+\(9)  <  9  and  yi+\(9)  +  Ayi+\  >  9.  The  perturbed  buffer  content  cannot 
exceed  9  +  A9,  since  Ayl+i  =  Axt  <  A 9  from  (B.21);  therefore,  yl+\  +  Ayl  <  9+A9  and  the 
situation  is  identical  to  that  of  Fig.  B.9(a),  again  yielding  (B.46).  Case  2.3:  yi+\(9)  >  9. 
As  seen  in  Fig.  B.9(b), 

Axj+i  =  A  9 

as  in  Case  II  where  perturbation  generation  was  considered.  Once  again,  however,  it  is 
possible  that  A 9  >  (ai  —  Pi)Ft  +  Axi,  so  that  we  write,  similarly  to  Case  II, 

AxI+i  =  A  9-  [A0  -  A  xi  -  (a*  -  Pi)Fi}+  .  (B.49) 

We  may  now  establish  (B.23)  by  combining  (B.46),  (B.47),  (B.48),  and  (B.49)  and  by 
observing  that  (i)  In  (B.48),  /*  <  A  Xi/ (Pi—ap  with  Pi—oti  >  0,  therefore  0  <  Axt+i  <  Axt, 
and  (ii)  In  (B.49),  0  <  Axj+i  <  Ad. 
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(b) 


Figure  B.9:  (a)  Cases  2. 1-2.2:  yt+i(9)  <  9.  (b)  Case  2.3:  yi+i(0 )  >  9 


Next,  by  combining  (B.46),  (B.47),  (B.48)  we  obtain  the  first  part  of  (B.24),  observing 
that  Ij  =  0  in  Case  1.1.  To  obtain  the  second  part,  we  combine  (B.46)  and  (B.49), 
observing  that  when  F)  =  0  in  (B.49),  we  get  Axj+i  =  Ax*,  since  A 9  —  A Xi  >  0  from 
(B.23),  which  reduces  to  (B.46)  corresponding  to  Cases  2. 1-2. 2.  I 

Proof  of  Lemma  5:  Proceeding  as  in  the  proof  of  Lemma  4,  we  first  consider  the  case 

at  —  <  0  and  get: 

Case  1.1:  yi+i(0)  >  0.  In  this  case,  as  seen  in  Fig.  B.7,  we  have: 

ALj+i  =  0  (B.50) 

Case  1.2:  r/j+i(0)  <  0  and  yi+\{0 )  +  Ay*+i  <  0.  Clearly,  as  seen  in  Fig.  B.8(a), 

ALj+i  =  0  (B.51) 

Case  1.3:  yi+i(9)  <  0  and  yi+\(9)  +  Ay.i+\  >  0.  This  represents  a  situation  where  an  EP 
of  length  It  is  eliminated  in  the  perturbed  path,  i.e.,  <  A Xj/(/3j  —  a*).  As  seen  in  Fig. 
B.8(b),  no  loss  is  involved  in  either  path: 

ALi+1  =  0.  (B.52) 

Next,  let  us  assume  that  Oj  —  /3j  >  0  and  we  have: 

Case  2.1:  y?:+i  (0)  <  6  and  yi+i(0 )  +  Ayi+i  <  9.  It  is  easy  to  see  (Fig.  B.9(a))  that  this 
is  identical  to  Case  1.1  yielding  (B.50). 

Case  2.2:  yi+ 1(0)  <  9  and  yi+\{9)  +  Ayi+\  >  9.  As  argued  in  the  proof  of  Lemma  4, 
the  situation  is  identical  to  that  of  Fig.  B.9(a),  again  yielding  (B.50). 

Case  2.3:  yi+ 1(6>)  >  9.  If  A 9  >  (at  -  +  Ax*,  then  ALi+l  =  0  -  (yi+ 1  -  9)  = 

— (cti— /3i)Fi.  Otherwise,  Li+\(9-\-A9)  =  yi+\+  Axi  —  9  —  A9,  and  we  get  ALi+\  =  Ax*  —  A9. 
Thus, 

ALi+1  =  (A Xi  -  A 9)  +  [A0  -  Axi  -  (a<  -  .  (B.53) 


101 


We  may  now  combine  (B.50),  (B.51),  (B.52),  (B.53).  Observe  that  in  (B.53)  ALj+i  >  —A 0, 
since  we  have  already  established  that  A Xi  >  0  in  Lemma  4.  Moreover,  ALi+\  =  —  {oti  — 
Pi)Fi  <  0  if  A8  —  Axi  —  (ai  —  /3i)Fi  >  0,  and  ALi+\  =  Axi  —  A6  if  A0  —  Axi  —  (ai  —  /3i)Fi  <  0, 
where  Axj  —  A 6  <  0  from  Lemma  4.  This  yields  (B.26).  I 

Proof  of  Lemma  6:  Proceeding  as  in  the  proof  of  Lemma  5,  we  get  ALj+i  =  0  in 
(B.50),  (B.51),  (B.52),  i.e. ,  in  all  cases  except  Case2.3  where  (B.53)  applies: 

A Li+l  =  (A Xi  -  AO)  +  [AO  -  Axi  -  (a*  -  ^)F<]+ 

Suppose  the  first  overflow  interval  in  the  BP  ends  at  Ar.  Under  the  assumption  AO  —  A x\  — 
(ai  —  Pi)Fi  <  0  for  all  i  =  ,  it  follows  from  (B.17)-(B.18)  that  Axr  =  AO  and 

A Lr  =  —AO.  Moreover,  from  Lemma  4,  (B.24)  gives  Ax*  =  AO  for  all  i  =  r  +  1, . . . ,  m. 
Therefore,  (B.53)  gives  ALj+i  =  0  after  every  subsequent  overflow  interval  and  we  get 
Ak(A0)  =  A  Lr  =  -AO.  I 
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Abstract 

Simulation  modeling  of  complex  systems  is  receiving  increasing  research  attention  over  the 
past  years.  In  this  paper,  we  discuss  the  basic  concepts  involved  in  multi-resolution  sim¬ 
ulation  modeling  of  complex  stochastic  systems.  We  argue  that,  in  many  cases,  using 
the  average  over  all  available  high-resolution  simulation  results  as  the  input  to  subsequent 
low-resolution  modules  is  inappropriate  and  may  lead  to  erroneous  final  results.  Instead 
high-resolution  output  data  should  be  classified  into  groups  that  match  underlying  patterns 
or  features  of  the  system  behavior  before  sending  group  averages  to  the  low-resolution  mod¬ 
ules.  We  propose  high- dimensional  data  clustering  as  a  key  interfacing  component  between 
simulation  modules  with  different  resolutions  and  use  unsupervised  learning  schemes  to  re¬ 
cover  the  patterns  for  the  high-resolution  simulation  results.  We  give  some  examples  to 
demonstrate  our  proposed  scheme. 

Key  words:  Hierarchical  simulation,  multi-resolution  simulation,  clustering. 
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AFOSR  under  contract  F49620-98-1-0387  and  by  ERPI/DoD  under  contract  WO8333-03. 
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C.l  Introduction 


In  modeling  complex  systems  it  is  impossible  to  mimic  every  detail  through  simulation. 
The  common  approach  is  to  divide  the  whole  system  hierarchically  into  simpler  modules, 
each  with  different  simulation  resolution.  In  this  context,  the  output  of  a  module  becomes 
an  input  parameter  to  another,  as  illustrated  in  Figure  C.l.  The  decomposed  modules 
can  be  high-resolution  or  low-resolution  models.  High-resolution,  e.g.  the  usual  discrete- 
event  simulation  models,  take  detailed  account  of  all  possible  events,  but  are  generally 
time  consuming.  Low-resolution  (or  coarser)  modules,  perform  aggregate  evaluation  of  the 
module’s  functionality  (i.e.,  determine  what  would  happen  “on  the  average”).  Such  modules 
are  less  time  consuming  and  can  be  any  of  the  following  components:  differential  equations 
(used  for  example  in  combat  [76]  and  semiconductor  simulations  [54]),  standard  discrete- 
event  simulation,  and  fluid  simulation  [86].  Furthermore,  the  decomposed  modules  can  also 
be  an  optimization  or  decision  support  tool  such  as  the  one  described  by  Griggs  et.  al[37] . 


DECOMPOSITION 


Figure  C.l:  Decomposition  of  complex  systems 

In  a  hierarchical  setting,  the  lower  level  simulator  (typically  a  high-resolution  model) 
generates  output  data  which  are  then  taken  as  input  for  the  higher  level  simulator  (typically 
a  low-resolution  model).  Hierarchical  simulation  is  a  common  practice,  but  the  design  of 
hierarchy  is  always  ad  hoc.  A  popular  practice  is  to  use  the  mean  values  of  variables  from  the 
lower  level  output  as  the  input  to  the  higher  level.  This  implies  that  significant  statistical 
information  (i.e.,  statistical  fidelity)  is  lost  in  this  process,  resulting  in  potentially  completely 
inaccurate  results.  Especially  when  the  ultimate  output  of  the  simulation  process  is  of  the 
form  0  or  1  (e.g.,  “lose”  or  “win”  a  combat),  such  errors  can  provide  the  exact  opposite  of 
the  real  output. 

A  systematic  design  and  analysis  framework  is  definitely  needed.  In  this  paper,  we 
present  some  fundamental  components  of  such  a  framework.  Our  effort  has  been  directed 
at  developing  an  interface  between  two  simulation  levels  to  preserve  statistical  fidelity  to 
the  maximum  extent  that  available  computing  power  allows.  Our  research  focus  is  to 
use  high-dimensional  clustering  techniques  to  group  the  high-resolution  sample  paths  into 
meaningful  clusters  and  pass  on  to  lower  resolution  module(s).  In  the  following  we  explain 
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in  more  detail  this  approach. 

Quite  often,  the  system  being  simulated  is  such  that  the  high-resolution  model  produces 
so  widely  divergent  outputs  that  it  does  not  make  sense  to  summarize  such  output  through 
a  single  average  over  the  entire  sample  space.  For  example  when  simulating  a  combat, 
it  does  not  make  any  sense  to  take  the  average  of  the  output  under  different  weather 
conditions.  In  such  cases,  we  must  subdivide  the  sample  space  into  segments,  and  get 
the  high-resolution  model  to  produce  an  appropriate  input  to  the  low-resolution  model 
for  each  such  segment.  Essentially,  the  low- resolution  model  will  be  broken  down  into  a 
number  of  distinct  components,  one  for  each  segment  of  the  sample  space.  To  carry  out 
such  a  segmentation,  the  high-resolution  paths  first  need  to  be  grouped  by  their  common 
features.  These  features  then  determine  and  feed  the  corresponding  low-resolution  model. 
The  practice  of  classifying  objects  according  to  perceived  similarities  is  the  basis  for  much 
of  science  and  engineering,  since  organizing  data  into  sensible  groupings  is  one  of  the  most 
fundamental  modes  of  understanding  and  learning.  Clustering  methods  have  been  widely 
applied  in  pattern  recognition,  image  processing,  and  artificial  intelligence.  In  this  paper, 
we  deal  with  clustering  methods  for  the  preservation  of  statistics  in  hierarchical  simulation. 

In  the  following  we  describe  the  idea  of  using  the  Adaptive  Resonance  Theory  (ART) 
neural  network  for  this  purpose.  ART  neural  networks  were  developed  by  Carpenter  and 
Grossberg  [10]  to  understand  the  clustering  function  of  the  human  visual  system.  They  are 
based  on  a  competitive  learning  scheme  and  are  designed  to  deal  with  the  stability/plasticity 
dilemma  in  clustering  and  general  learning.  It  is  clear  that  too  much  stability  would  lead 
to  a  “stubborn”  mind,  while  too  much  plasticity  would  lead  to  unstable  learning.  ART 
neural  networks  successfully  resolve  this  dilemma  by  matching  the  input  pattern  with  the 
prototypes.  If  the  matching  is  not  adequate,  a  new  prototype  is  created.  In  this  way, 
previously  learned  memories  are  not  eroded  by  new  learning.  In  addition,  the  ART  neural 
network  implements  a  feedback  mechanism  during  learning  to  enhance  stability. 

Our  experiments  of  using  ART  neural  networks  with  combat  simulation  paths  have  been 
quite  successful  [14].  We  believe  further  improvement  with  the  ART  structure  can  lead  to 
a  fundamental  breakthrough  in  large  data  clustering,  which  is  needed  in  complex  systems 
modeling.  ART  performs  the  clustering  function  based  on  the  “angle”  between  the  vectors 
that  describe  the  various  input  patterns.  In  some  cases,  as  it  will  be  discussed  in  the  sequel, 
this  may  pose  a  limitation  of  ART  since  the  magnitude  of  an  input  pattern  may  contain 
significant  information  which  is  ignored.  To  alleviate  this  shortfall  we  develop  a  heuristic 
that  allows  the  magnitude  of  the  input  pattern  to  play  a  role  in  the  clustering  function. 
Furthermore,  we  are  developing  a  generic  numerical  clustering  tool,  based  on  the  ART 
neural  network,  that  can  be  used  for  many  important  problems  in  intelligent  data  analysis. 

In  general,  the  description  of  a  typical  sample  path  generated  by  a  discrete-event  system 
requires  a  large  amount  of  data  since  such  sample  paths  are  typically  quite  long.  This  implies 
that  the  dimension  of  each  input  pattern  will  also  be  large.  However  for  high  dimensional 
data  most  of  the  clustering  algorithms  (including  ART)  will  involve  huge  computational 
effort;  thus  they  are  not  practical  for  simulation  modeling  purposes.  For  this  reason  we 
develop  a  new  clustering  approach  where  we  try  to  take  advantage  of  the  statistical  structure 
behind  a  typical  sample  path.  For  high  dimensional  complex  systems  we  try  to  use  a  Hidden 
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Markov  Model  (HMM),  which  has  be  successfully  used  in  speech  recognition  and  other 
areas [69],  to  characterize  each  observed  sample  path.  In  our  approach,  we  use  an  HMM 
to  describe  an  arbitrary  sample  path  and  thus  we  cluster  together  all  sample  paths  whose 
corresponding  HMM  have  a  high  similarity  measure  (to  be  defined  in  Section  C.6).  The 
advantage  of  this  approach  is  that  the  amount  of  data  required  to  describe  an  HMM  is 
generally  much  smaller  than  the  amount  of  data  required  to  explicitly  describe  an  observed 
sample  path  and  as  a  result  the  HMM  approach  is  more  efficient. 

The  remaining  of  this  paper  is  organized  as  follows.  Section  C.2  discusses  some  of  the  is¬ 
sues  that  arise  when  interfacing  between  modules  with  different  resolution  while  Section  C.3 
presents  an  example  where  statistical  fidelity  is  lost  as  a  result  of  poor  interfacing  between 
high-  and  low-resolution  models,  i.e.,  passing  only  averages  from  high-  to  low-resolution 
models.  Section  C.4  briefly  presents  the  ART  clustering  algorithm  and  Section  C.5  describes 
an  ART  based  tool  that  we  have  developed  and  its  application  on  a  complex  manufacturing 
system.  Section  C.6  presents  a  new  approach  for  clustering  sample  paths  based  on  HMMs 
and  finally  Section  C.7  summarizes  our  work. 


C.2  Interface  between  high-  and  low-resolution  models 


As  mentioned  above,  the  key  issue  in  hierarchical  simulation  is  the  design  of  the  interface 
between  the  hierarchies.  In  a  typical  hierarchical  simulation  model,  the  lower  lever  consist  of 
a  high-resolution  model,  such  as  the  discrete  event  simulator,  that  generates  several  sample 
paths  given  some  input  parameters  u.  The  output  of  such  simulation  models  is  then  used  as 
input  to  the  higher  level  model  (typically  a  low-resolution  model).  The  question  that  arises, 
the  focal  point  of  this  paper,  is  how  much  and  what  information  we  need  to  pass  from  the 
high-resolution  to  the  low-resolution  model  such  that  statistical  fidelity  is  preserved. 

Note  that  each  sample  path  generated  by  the  high-resolution  model  is  also  a  function 
of  some  randomness  c o  (a  random  number  sequence  generated  through  some  random  seed). 
Thus,  any  function  evaluated  over  an  observed  sample  path  (e.g.,  /i(u,o;))  is  also  a  ran¬ 
dom  variable.  Typically,  we  are  not  interested  in  the  value  of  h( u,  u>)  obtained  from  a 
single  sample  path  but  rather  the  expectation  E{h(u,  a;)}.  Based  on  this,  in  hierarchical 
simulation  it  is  customary  to  use  E{h(u,ia)}  as  an  input  parameter  to  the  higher  level 
model  as  seen  in  Figure  C.2.  This  is  often  highly  unsatisfactory,  since  the  mean  often 
obscures  important  features  of  the  high-resolution  output.  Said  in  another  way,  we  are 
seeking  E{L(h{ u,u;))},  where  L(-)  is  a  function  corresponding  to  the  low-resolution  model, 
but  what  we  end  up  evaluating  by  passing  a  single  average  is  L(E{h(u,Lo)})-,  however,  in 
general  E{L(h(u,  u;))}  /  L(E{h( u,  w)}). 

To  solve  this  problem  we  propose  the  use  of  clustering  to  identify  groups  of  sample  paths 
that  have  some  “common  features”,  and  therefore,  when  averaged  together  do  not  cause 
the  loss  of  too  much  information.  This  approach  in  shown  in  Figure  C.3.  From  the  N 
observed  sample  paths  we  identify  m  <  N  groups  that  share  some  common  features  and 
determine  m  input  parameters  ai,  •  •  •  ,am  where  a*  =  E{hl{ u,a;))}  and  hr{-)  identifies  all 
sample  paths  in  cluster  i.  Subsequently,  each  parameter  a*  is  used  as  an  input  to  a  lower 
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Figure  C.2:  Hierarchical  model  interface:  passing  a  simple  average  to  the  lower  resolution 
model 


resolution  model  and  finally  we  obtain  E{L(a.i)}  over  m  low-resolution  components,  which 
we  claim  is  a  better  estimate  of  the  overall  system  output  than  the  one  obtained  using  a 
single  average. 


Figure  C.3:  Hierarchical  model  interface:  passing  several  averages  to  the  lower  resolution 
model,  one  for  each  cluster 


One  may  pose  the  following  question:  Since  the  desired  output  is  of  the  form  E{L(h( u,  u;))} 
why  bother  with  clustering  at  all  when  we  can  evaluate  L(h(u,  lu))  for  all  N  obtained  sam¬ 
ples  and  then  perform  the  required  expectation,  especially  since  the  low-resolution  models 
are  generally  easy  to  evaluate?  The  answer  to  this  question  lies  in  the  derivation  of  the 
low-resolution  model.  Typically,  L(a)  assumes  that  a  is  an  expectation  and  therefore  it 
would  be  meaningless  to  use  some  quantity  obtained  from  a  single  sample  path. 


107 


C.3  Losing  statistical  fidelity:  an  example 


In  this  section  we  present  a  simple  example  where  the  loss  of  statistical  fidelity  can  result  in 
poor  use  of  resources  with  possible  catastrophic  consequences.  For  this  example,  we  assume 
that  we  are  interested  in  planning  a  mission  that  consists  of  several  operations  Oi,  •  •  •  ,On 
as  shown  in  Figure  C.4.  Due  to  the  dependence  between  operations  (e.g.,  O4  cannot  start 
until  O2  and  O3  are  completed),  it  is  required  to  know  the  time  requirements  of  each 
operation.  So  it  is  natural  to  ask  the  following  question:  How  much  time  should  be  allocated 
for  each  operation  so  that  the  probability  of  not  completing  an  operation  within  the  allocated 
time  (threshold/ deadline)  is  less  than  Pq? 


Figure  C.4:  Operation  Scheduling 


The  methodology  for  attacking  this  problem  follows  the  hierarchical  structure  described 
next.  For  every  individual  operation  there  exists  a  high-resolution  model  that  simulates  the 
operation  under  different  scenaria  and  returns  the  average  m  and  standard  deviation  a  of 
the  time  that  it  takes  to  complete  it.  Subsequently,  the  planners  use  a  low-resolution  model 
to  determine  the  time  to  allocate  to  the  operation  so  that  the  probability  of  not  meeting  the 
deadline  is  less  than  Pq.  One  such  low-resolution  model  is  the  Normal  distribution,  and  the 
probability  of  not  meeting  the  scheduled  deadline  is  given  by  the  corresponding  cumulative 
distribution  function.  Hence,  the  problem  is  to  find  the  threshold  time  T  such  that 


F(T)  =  1  - 


1  _  1  i  x—m 

. _  e 

\/2tt<7 


dx  <  Pq 


(C.l) 


A  typical  operation  in  a  mission  involves  the  Aircraft  Maintenance  and  Refueling  System 
(ARMS)  shown  in  Figure  C.5.  In  this  system  there  are  Q  classes  {C\ ,  •  •  •  ,  Cq }  of  aircraft 
that  can  be  refueled/maintained  in  one  of  B  bases  {B 1,  •  •  •  ,  Bb}-  Before  an  aircraft  receives 
any  service,  it  needs  to  travel  to  the  corresponding  base  for  a  time  r  which  depends  on  the 
initial  location  of  the  aircraft  with  respect  to  the  base.  Once  an  aircraft  (say  aircraft  a) 
arrives  at  the  base,  it  is  assigned  one  of  the  M  priorities  {Pi,--  -  ,Pm}  and  it  is  placed 
in  the  corresponding  queue.  If  a  token  is  available  at  the  token  queue,  it  is  assigned  to  a 
which  then  proceeds  to  the  next  FIFO  queue.  Once  all  preceding  aircraft  are  serviced,  a 
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enters  the  server  where  it  stays  for  a  random  period  of  time  with  mean  —  which  depends 

H'bc 

on  the  base  b  and  the  aircraft  class  c.  For  a  given  mission,  it  is  required  to  service  A 
aircraft  before  the  next  operation  can  start.  Therefore,  one  can  use  a  simulation  model 
to  determine  the  time  it  takes  to  process  the  A  aircraft.  Note  that  this  processing  time 
depends  on  several  parameters:  (a)  The  number  of  bases  and  the  routing  algorithm  that 
determines  what  aircraft  goes  to  what  base,  (b)  The  initial  location  of  each  aircraft  and  (c) 
The  service  times  of  each  class  of  aircraft  at  each  base.  These  parameters  are  considered  as 
the  initial  conditions  of  each  simulation  (high-resolution  model)  and  the  output  is  going  to 
be  used  as  an  input  to  the  low-resolution  model  (i.e.,  the  Normal  distribution). 


Figure  C.5:  ARMS  Model 


C.3.1  Simulation  Results 

To  test  the  clustering  ideas  we  use  the  following  simple  problem.  We  assume  that  this 
operation  involves  A  =  100  aircraft  that  can  be  classified  in  Q  =  3  classes.  An  aircraft  is 
classified  as  C\  with  probability  0.25,  as  C2  with  probability  0.25  and  as  C3  with  probability 
0.5.  Each  aircraft  is  routed  probabilistically  to  one  of  two  identical  bases  (B  =  2).  Each 
base  has  a  total  of  3  tokens  and  each  class  C\,  C2,  C3  requires  service  for  a  random  period 
of  time  which  is  Erlang  distributed  with  means  1.2,  1.8,  2.4  time  units  respectively.  The 
routing  to  each  base  depends  on  the  initial  “state  of  the  world”  W  (e.g.,  current  weather 
conditions).  We  assume  that  W  can  only  take  two  values  0  and  1  with  probabilities  0.8 
and  0.2  respectively.  If  W  =  0,  then  the  aircraft  are  routed  to  either  Base  1  or  Base  2  with 
equal  probability.  On  the  other  hand,  when  W  =  1,  all  aircraft  are  routed  to  Base  1. 

We  simulated  the  ARMS  model  65,000  times  and  obtained  the  histogram  shown  in 
Figure  C.6  for  the  time  it  takes  to  process  all  100  aircraft.  One  observation  is  that  the 
average  time  to  process  all  100  aircraft  is  about  135  time  units,  however,  the  probability  of 
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actually  processing  all  aircraft  in  135  hours  is  very  small.  From  the  simulation  results  we 
also  obtain  an  estimate  of  the  standard  deviation  which  was  found  equal  to  33.9  time  units. 


Figure  C.6:  Completion  time  of  the  first  A  =  100  aircraft 

For  planning  purposes  now,  it  is  required  to  find  the  smallest  deadline  such  that  the 
probability  of  missing  it  is  less  that  Po  =  10%.  Using  the  normal  distribution  with  m  =  135 
and  a  =  33.9  we  find  using  (C.l)  that  the  deadline  should  be  set  at  178.4  time  units. 
However,  using  all  simulation  results  it  is  found  that  in  order  to  meet  the  deadline  with 
probability  equal  to  90%  it  is  necessary  to  set  it  at  200  time  units.  Therefore,  use  of  the 
simple  averages  has  resulted  in  an  error  of  about  11%.  What  is  more  dramatic  is  the 
error  in  the  probability  of  missing  the  deadline.  If  we  use  the  deadline  suggested  by  the 
previous  procedure  on  the  data  obtained  through  simulation,  it  is  observed  that  the  actual 
probability  of  missing  the  deadline  is  not  10%  as  was  originally  desired  but  about  19.7%, 
therefore  there  is  an  error  of  about  98%.  What  is  also  interesting  is  that  this  error  was 
obtained  even  though  we  passed  both,  first  and  second  order  statistics  to  the  low-resolution 
model.  One  might  expect  the  errors  to  be  even  larger  in  cases  where  only  first  order  statistics 
are  passed  to  the  low-resolution  model. 

To  solve  this  problem,  we  can  use  a  simple  clustering  approach.  Rather  than  using  a 
single  average  and  standard  deviation,  we  can  form  groups  of  data,  determine  the  average 
and  standard  deviation  of  each  group  and  use  those  estimates  to  drive  the  low-resolution 
model.  For  our  simulation  example  it  is  natural  to  cluster  the  obtained  data  based  on  the 
initial  conditions  of  each  simulation.  Therefore,  all  data  that  were  obtained  when  W  =  0 
form  cluster  0  and  data  that  were  obtained  when  W  =  1  form  cluster  1.  Using  this  grouping, 
we  obtain  the  following  results 


Cluster  0 

Cluster  1 

Samples 

52,014 

12,986 

m 

118.5 

200.3 

a 

8.7 

10.4 
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Using  these  results,  we  form  the  weighted  sum  of  two  Normal  distributions 


fix) 


52,014 

65,000 


N(m0,a0)  + 


12,986 

65,000 


(C.2) 


Finally,  using  trial  and  error  we  find  that  in  order  to  meet  the  deadline  with  probability 
10%,  it  is  necessary  to  set  the  deadline  at  200  time  units  which  is  in  agreement  with  the 
simulation  results  as  well. 

If  clustering  is  used  as  an  approach  to  preserve  statistical  fidelity,  we  need  a  system¬ 
atic  way  of  grouping  sample  paths  into  clusters,  especially  when  there  is  no  apparent  way 
to  classify  sample  paths  like  the  “world  state”  W  we  used  in  the  above  example.  Such 
approaches  are  presented  in  the  next  two  sections. 


C.4  Clustering  using  Adaptive  Resonance  Theory  (ART) 


Our  work  is  motivated  by  our  earlier  path  bundle  grouping  approach  in  hierarchical  combat 
simulation.  In  dealing  with  hierarchical  simulation  models  one  needs  to  consider  grouping 
sample  paths  generated  from  high-resolution  models  so  as  to  provide  appropriate  input 
statistics  to  the  lower  resolution  model.  This  requires  clustering  very  high  dimensional  data 
vectors  (the  sample  paths  from  high-resolution  simulators).  We  have  used  this  approach  in 
the  Concept  Evaluation  Model  (CEM)  of  the  Concept  Analysis  Agency  in  order  to  group  the 
sample  paths  from  the  high-resolution  Combat  Sample  Generator  (COSAGE)  and  generate 
the  input  to  the  lower  resolution  Attrition  Calculation  (ATCAL).  Concrete  numerical  results 
are  reported  in  Guo  et.  al  [39] . 

The  algorithm  used  for  clustering  in  the  ART  framework  is  closely  related  to  the  well- 
known  fc-means  clustering  algorithm.  Both  use  single  prototypes  to  internally  represent  and 
dynamically  adapt  clusters.  The  /c-means  algorithm  clusters  a  given  set  of  input  patterns 
into  k  groups.  The  parameter  k  thus  specifies  the  coarseness  of  the  partition.  In  contrast, 
ART  uses  a  minimum  required  similarity  between  patterns  that  are  grouped  within  one 
cluster.  The  resulting  number  of  clusters  depends  on  the  distances  (in  terms  of  the  applied 
metric)  between  all  input  patterns,  presented  to  the  network  during  training  cycles.  This 
similarity  parameter  is  called  vigilance  and  is  denoted  by  p. 

The  basic  ART  architecture  consists  of  the  input  layer  F\ ,  the  output/cluster  layer  F2 
and  the  reset  mechanism  which  controls  the  degree  of  similarity  required  among  patterns 
that  are  placed  in  the  same  cluster.  F\  has  n  input  units,  thus  each  input  pattern  must  have 
dimension  n,  while  the  output  layer  consists  of  m  units,  therefore  the  maximum  number  of 
clusters  that  can  be  generated  by  the  network  is  m.  Note  that  F2  is  a  competitive  layer, 
in  other  words  only  the  unit  with  the  largest  input  has  an  activation  other  than  zero,  and 
therefore  each  unit  corresponds  to  a  different  cluster.  Furthermore,  the  input  layer  F\  is 
subdivided  into  two  sub-layers  F\(a)  the  input  portion  and  F±(b)  the  interface  portion  which 
receives  input  from  both  the  input  layer  F\(a)  and  the  output  layer  F-2 . 


Ill 


In  the  ART  algorithm,  every  input  pattern,  after  some  preprocessing,  is  compared  to 
each  of  the  m  prototypes  which  are  stored  in  the  network’s  weights.  If  the  degree  of 
similarity  between  the  current  input  pattern  I  and  the  best  fitting  prototype  J  is  at  least  as 
high  as  a  given  vigilance  p,  prototype  J  is  chosen  to  represent  the  cluster  containing  I;  The 
vigilance  p  defines  the  minimum  similarity  between  an  input  pattern  and  the  prototype  of 
the  cluster  it  is  associated  with  and  is  typically  limited  to  the  range  [0, 1].  If  the  similarity 
between  I  and  the  best  fitting  prototype  J  does  not  fit  into  the  vigilance  interval  [p,  1],  then 
a  new  cluster  has  to  be  installed,  where  the  current  input  is  most  commonly  used  as  the 
first  prototype  or  “cluster  center”.  Otherwise,  if  one  of  the  previously  committed  clusters 
matches  I  well  enough,  its  prototype  is  adapted  by  being  slightly  shifted  towards  the  values 
of  the  input  pattern  /.  For  a  detailed  description  and  analysis  of  the  ART  algorithm  refer 
to  Carpenter  and  Grossberg  [10]  and  Fausett  [27] . 


Often,  the  level  of  detail  of  the  input  patterns  may  be  different;  some  input  patters  may 
have  less  than  n  non-zero  components.  In  ART  this  has  motivated  the  use  of  a  similarity 
measure  that  is  independent  of  the  magnitude  of  the  input  pattern  and  so  input  patterns 
are  first  normalized  before  presented  to  the  neural  network.  Rather,  the  similarity  measure 
is  based  on  the  angle  between  the  vector  that  corresponds  to  an  input  pattern  and  the 
cluster’s  prototype  vector  as  illustrated  in  Figure  C.7  for  a  two-dimensional  input  vector 
(n  =  2).  This  figure  shows  the  jth  cluster  prototype  vector  W j  and  the  range  of  the  cluster 
which  extends  (3  radians  above  and  below  the  angle  of  W j.  Note  that  /3  is  a  function  of  the 
vigilance  parameter  p;  the  larger  the  value  of  p  the  higher  similarity  is  required  among  the 
vectors  that  are  clustered  in  the  same  cluster  and  thus,  the  smaller  the  angle  fi.  For  this 
example,  the  angle  between  the  input  pattern  I  and  Wy  is  (j)  >  P  therefore  I  will  not  be 
assigned  to  cluster  j. 


Figure  C.7:  Similarity  in  ART  is  measured  by  the  angle 
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C.5  An  application  to  a  “real-world”  complex  system 


As  mentioned  earlier,  we  are  also  developing  a  generic  clustering  tool,  the  Intelligent  Clus¬ 
tering  Interface  (ICI)  and  we  encountered  an  interesting  opportunity  to  test  it  in  the  case  of 
a  complex  manufacturing  system.  In  particular,  in  working  with  a  large  metal  manufacturer 
we  were  faced  with  the  issue  of  supplying  a  low-resolution  model  of  a  large  plant  with  the 
necessary  parameters  for  running  it.  These  parameters  are  to  be  obtained  from  detailed 
(high-resolution)  models  of  the  process  plans  (or  flowpaths)  for  over  10,000  products  man¬ 
ufactured  in  the  plant.  A  flowpath  is  a  specific  sequence  of  Production  Centers  (PCs)  with 
different  processing  characteristics  at  each  PC  (there  are  over  100  such  PCs).  Thus,  each 
flowpath  may  be  thought  of  as  corresponding  to  a  unique  product;  however,  since  the  low- 
resolution  model  cannot  possibly  handle  input  data  for  over  10,000  flowpaths,  the  objective 
is  to  group  products  with  similar  flowpaths.  For  purposes  such  as  forecasting,  capacity 
planning,  and  lead-time  estimation  (among  others)  it  is  in  fact  indispensable  to  have  such 
product  groups  available:  not  only  it  is  conceptually  infeasible  to  work  with  over  10,000 
distinct  products,  it  is  also  practically  impossible  to  input  such  high-dinrensional  data  for 
over  10,000  products  and  100  PCs  into  modeling  and  decision  support  tools.  Moreover, 
even  if  there  were  an  automated  way  to  accomplish  this,  it  would  be  unrealistic  to  expect 
anyone  to  manipulate  or  interpret  output  data  with  information  such  as  inventory  levels 
and  lead  times  for  many  thousands  of  distinct  products. 

In  the  effort  to  establish  groups  (or  clusters)  of  products  based  on  similarities  in  flow- 
paths  and  processing  characteristics,  an  initial  project  was  set  up  with  plant  experts  given 
the  task  to  “manually”  create  such  groupings.  The  project  was  quickly  abandoned:  in  ad¬ 
dition  to  the  sheer  product  volume  which  makes  this  task  prohibitive,  it  is  also  difficult  to 
rationally  quantify  “similarities”  in  flowpaths  and  processing  data  without  some  systematic 
means  of  doing  so.  We  were  able  to  accomplish  this  task  using  the  clustering  techniques  we 
have  developed  and  obtained  a  “compression”  of  over  10,000  products  to  25-100  product 
clusters  (depending  on  the  aggregation  accuracy  required,  which  is  completely  controlled 
by  the  analyst).  Of  particular  interest  is  the  fact  that  the  plant  experts  who  reviewed  the 
results  we  obtained  found  “by  hindsight”  the  clusters  defined  by  our  method  consistent  with 
their  expectations. 

Unlike  other  applications  of  ART,  in  this  case,  the  magnitude  of  the  vector  corresponding 
to  an  input  pattern  contains  important  information  that  should  be  used  when  grouping 
products  into  clusters.  For  example,  the  input  vectors  corresponding  to  two  products  may 
have  the  same  orientation  (same  flowpath)  but  differ  in  their  magnitude  (the  time  spent  at 
PC’s  differ  by  orders  of  magnitude).  For  this  reason  we  developed  an  enhancement  to  the 
ART  algorithm  that  allows  us  to  include  magnitude  information  in  the  clustering  process 
as  described  next. 


C.5.1  ART  Enhancements 

As  already  pointed  out,  the  basic  mechanism  through  which  the  ART  neural  network  per¬ 
forms  clustering  is  by  grouping  them  using  an  “angle”  criterion.  This  is  illustrated  in 
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Figure  C.8  where  we  used  the  ICI  tool  to  cluster  200  2-dinrensional  vectors.  For  a  vigilance 
parameter  value  of  p  =  0.99,  three  clusters  were  obtained  as  seen  in  the  figure. 


Figure  C.8:  200  2-dim.  vectors,  no  extra  dimension,  p  =  0.99 

It  is  reasonable  to  expect,  however,  that  data  vectors  with  almost  identical  orientation 
but  significantly  different  magnitudes  (prior  to  normalization)  should  be  distinguishable.  In 
order  to  introduce  this  capability  into  the  ART  setting,  we  have  introduced  the  following 
enhancement:  Each  data  vector  is  provided  with  an  additional  component  (thus  enlarging 
its  dimensionality  from  nton+1)  which  is  the  Euclidian  norm  of  the  n-dimensional  vector 
(xi, . . . ,  xn),  he.,  (xf +. .  .+x2)1//2.  This  forces  the  ART  neural  network  to  include  magnitude 
information  in  its  clustering  algorithm.  This  is  clearly  seen  in  Figure  C.9,  where  the  same 
data  vectors  as  in  Figure  C.8  have  been  clustered  with  this  additional  component.  There 
are  still  three  clusters,  but  one  can  see  that  the  contents  and  features  of  the  clusters  have 
changed. 


Figure  C.9:  Same  200  2-dim.  vectors,  WITH  extra  dimension,  p  =  0.99 

An  alternative  is  to  apply  the  same  idea  by  individually  reclustering  each  of  the  clusters 
originally  obtained.  Thus,  within  each  cluster  we  may  now  distinguish  between  data  vectors 
in  terms  of  their  magnitude,  as  illustrated  in  Figure  C.10  where  10  clusters  are  now  obtained 
by  reclustering  each  of  the  three  clusters  in  Figure  C.8  using  a  magnitude  component. 

Finally,  note  that  within  the  ICI  tool  there  is  the  capability  of  providing  a  “weight”  to 
each  component  of  the  n-dimensional  data  vectors.  Thus,  by  controlling  the  value  of  this 
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Figure  C.10:  Same  200  2-dim.  vectors,  WITH  extra  dim.  reclustered  within  each  of  the  3 
original  clusters,  p  =  0.99 


weight  for  the  extra  component  added,  we  can  adjust  the  importance  to  be  attributed  to 
the  magnitude  of  a  vector  as  opposed  to  just  its  orientation  in  n  dimensions. 


C.6  Using  Hidden  Markov  Model  for  Sample  Path  Cluster¬ 
ing 


A  sample  path  generated  by  a  discrete-event  system  consists  of  a  sequence  {(e^i*,)}, 
k  =  1,  2,  •  •  •  ,  K,  where  denotes  the  fcth  event  and  its  corresponding  occurrence  time. 
For  typical  systems,  the  number  of  observed  events  K  is  very  large  and  thus  clustering 
sample  paths  “directly”  by  making  explicit  use  of  the  entire  event  sequence  {(e^,  tfc) },  re¬ 
quires  that  the  input  vectors  corresponding  to  such  sample  paths  should  be  of  a  very  large 
dimension.  This  impose  a  significant  computational  burden  on  any  clustering  algorithm 
including  ART.  To  solve  this  problem  we  try  to  take  advantage  of  the  structure  of  discrete- 
event  sample  paths.  In  our  experiments,  we  observe  a  sample  path  that  is  generated  by 
an  arbitrary  system  and  try  to  describe  it  by  some  Markov  Chain,  thus  we  use  the  theory 
of  Hidden  Markov  Models  [69]  (HMMs)  to  identify  its  parameters.  Once  we  identify  the 
HMM  parameters  we  define  a  similarity  measure  among  each  obtained  HMM  and  cluster 
together  all  sample  paths  with  the  largest  similarity.  The  advantage  of  this  approach  is 
that  the  amount  of  information  required  to  describe  an  HMM  is  considerably  less  than  the 
amount  of  information  required  to  describe  a  sample  path.  Even  though  the  identification 
of  the  HMM  parameters  require  some  additional  computational  overhead,  our  experiments 
have  shown  that  overall,  the  HMM  approach  is  considerably  faster  than  direct  clustering 
approaches.  Incidentally,  we  point  out  that  this  approach  makes  no  a  priori  assumptions 
about  the  statistical  distribution  of  the  data  to  be  analyzed. 

Next,  we  demonstrate  the  HMM  clustering  approach  through  an  example.  For  the 
purposes  of  our  example,  we  assume  that  we  have  three  systems  Si,  S2  and  S3.  When 
simulated,  each  system  generates  sample  paths  Qij,  i  =  {1,2,3},  j  =  1,2,---  where  Qij 
corresponds  to  the  jth  sample  path  generated  by  system  S?;.  When  clustering  sample  paths, 
it  would  be  reasonable  to  expect  that  sample  paths  generated  from  the  same  system  are 
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grouped  in  the  same  cluster.  In  this  example,  we  generate  9  sample  paths,  3  from  each 
system  and  develop  a  way  to  distinguish  between  sample  paths  obtained  from  different 
systems.  To  achieve  this,  we  first  associate  an  HMM  A i  =  (Ai,  Bi,iri)  to  each  sample  path 
i  =  l,---  ,9,  where  we  use  the  notation  of  Rabiner[69].  A,  denotes  the  state  transition 
probability,  Bi  denotes  the  observation  symbol  probability  at  every  state  and  7 r*  denotes 
the  initial  state  distribution.  To  complete  the  definition  of  the  HMM,  we  assume  the  it 
consists  of  N  states  and  that  at  every  state  we  can  observe  any  of  the  M  possible  symbols. 

To  construct  the  three  systems  Si,  i  =  1,2,3  we  assume  that  they  consist  of  a  Markov 
Chain  with  Nt  states  (N\  =  20,  IV2  =  10  and,  IV3  =  10)  and  randomly  generate  a  state  tran¬ 
sition  probability  matrix  Pj  =  [pL],  k,l  =  1,  2,  •  •  •  ,  Nt.  Furthermore,  we  randomly  generate 
the  parameters  ji\,  k  =  1, 2,  •  •  •  ,  N{,  so  that  the  exponentially  distributed  sojourn  time  for 
state  k  has  mean  1  //A.  However,  not  all  real  systems  are  memoryless,  therefore,  to  make 
our  example  more  interesting  we  introduce  some  “special  states”  where  state  transitions  out 
of  such  states  are  not  made  according  to  the  state  transition  probability  but  rather  through 
the  set  of  rules  we  describe  next. 

For  S\,  we  assume  that  states  l,---  ,5  are  special  states  and  state  transitions  out  of 
these  states  are  made  according  to  the  following  rules: 


State  1:  System  stays  at  state  1  for  3  consecutive,  exponentially  distributed  sojourn  time 
intervals,  each  with  mean  1  / /i\.  Then  it  jumps  to  state  10. 


State  2:  System  stays  at  state  2  a  deterministic  sojourn  time  interval  of  length  2/ fi\.  Then 
it  jumps  to  state  n  according  to  the  state  visited  before  arriving  at  state  2,  denoted 
by  S- 1: 


f  15  if  S-i  G  {1,  •  •  •  ,10}, 
\  4  if  S-!  €  (10,- ,20}. 


State  3:  System  stays  at  state  3  for  5  consecutive  sojourn  time  intervals,  each  exponentially 
distributed  with  mean  1/ n\.  Then  it  transfers  according  to  state  transition  probability 
matrix  Pi.  Note  that  after  the  system  has  spent  5  sojourn  intervals  at  state  3,  it  may 
return  to  state  3  with  probability  pg3  for  5  more  intervals. 

State  4:  System  stays  at  state  4  for  1  exponentially  distributed  sojourn  time  interval  with 
mean  1/ p\.  Then  it  jumps  to  state  n  according  to  the  state  visited  before  arriving  at 
state  4: 

6  if  5_i  G  {1,  •  •  •  ,5}, 

11  if  S-i  G  {6,  •  •  •  ,  10}, 

f)  —  J 

16  if  S-i  G  {11,--  ,15}, 
kl  if  SLi  G  {16,- ••  ,20}, 


State  5:  System  stays  at  state  5  for  a  deterministic  sojourn  interval  of  length  I///5,  and 
then  transfers  according  to  the  state  transition  probability  matrix  Pi. 


Both  S2  and  S3  have  only  2  special  states,  states  1  and  2  and  the  rules  out  of  these 
states  are  the  following: 
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State  1:  System  stays  at  state  1  for  2  sojourn  time  interval,  each  of  which  is  exponentially 
distributed  with  mean  i  =  2,3.  Then  it  transfers  to  state  n  according  to  the 

previous  state  visited: 

f  7  if  S-i  £  {1,2}, 
n  =  <  9  if  S- 1  G  {3}, 

{  5  if  S-i  G  {4,  •  •  •  ,10}, 

State  2:  System  stays  at  state  2  for  a  deterministic  amount  of  time  equal  to  1  / n\,  i  =  2,3, 
and  then  transfers  to  state  5. 

For  each  of  the  9  sample  paths  we  generate  by  S\,  S-2 ,  and  S3,  we  adjust  the  HMM 
parameters  Xj,  j  =  1,  •  •  ■  ,9,  to  maximize  the  probability  that  the  jth  observed  sample 
path  was  obtained  from  A  j.  This  is  referred  to  as  the  training  problem  and  is  tackled  by 
repeatedly  solving  what  is  described  as  “Problem  3”  in  Rabiner[69].  For  the  purposes  of 
this  experiment,  we  assume  that  each  HHM  consists  of  N  =  6  states.  In  addition,  we 
assumed  that  the  actual  state  visited  by  each  of  the  systems  is  not  observable.  Rather,  the 
observation  symbols  at  each  state  are  the  state  holding  times.  These  can  generally  take  any 
positive  value  so  to  determine  Bi,  the  observed  symbol  probability,  we  quantize  all  possible 
values  into  M  =  64  intervals. 

Once  we  determine  the  HMM  parameters  for  all  sample  paths,  A*,  i  =  1,  •  •  •  ,  9,  we  use 
the  similarity  measure  also  defined  in  Rabiner[69]  to  determine  which  HMMs  and  conse¬ 
quently  which  sample  paths  are  sufficiently  similar  so  that  they  can  be  clustered  together. 
The  similarity  measure  is  defined  for  any  pair  of  HMMs  A*  and  A  j  as: 

cr(A  i,  A  j)  =  exp{D(Aj,  Xj)}  (C.3) 


where 


D(Xi,  X j) 


logPr(CP|Aj)  +  logPr(0*|Aj)  —  logPr(0*|Aj)  —  logPi^O^Aj) 

2  TK 


(C.4) 


is  what  Rabiner  [69]  called  the  distance  measure.  Pr(0*|Aj)  is  the  probability  of  the  obser¬ 
vation  sequence  Ol,  i.e. ,  the  sequence  of  state  holding  times  that  correspond  to  the  sample 
path  Qi,  was  generated  by  HMM  A  j.  For  computational  convenience,  we  break  any  sample 
path  into  K  smaller  sample  paths  of  length  T  and  thus  compute 


I\ 

logPr  (0*|Aj)  =  ^^logPr  (0[.|Aj) 
k= 1 


where  Pr  (Olk\Xj)  is  the  probability  of  the  kth  sequence  of  sample  path  i  was  generated  by 
HMM  A  j.  Also,  note  that  the  similarity  measure  is  symmetric,  that  is  a(Xi,  X  j)  =  cr(Xj,  A*) ; 
a  desired  property  for  a  good  similarity  measure. 

For  the  similarity  results  shown  in  Table  C.l,  the  length  of  each  of  the  9  sample  path  is 
10,  000  events.  In  addition,  the  parameters  /i* ,  i  =  1,  2, 3,  j  =  1,  •  •  •  ,  Nt  are  generated  such 
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Table  C.l:  Similarity  measure  among  the  HMMs  corresponding  to  each  of  the  9  sample 
paths. 


HMM  1 

HMM  2 

HMM  3 

HMM  4 

HMM  5 

HMM  6 

HMM  7 

HMM  8 

HMM  9 

HMM  1 

1 

0.760 

0.769 

0.776 

0.950 

0.950 

0.798 

0.804 

0.794 

HMM  2 

0.760 

1 

0.940 

0.949 

0.772 

0.776 

0.837 

0.839 

0.835 

HMM  3 

0.769 

0.940 

1 

0.947 

0.777 

0.785 

0.850 

0.847 

0.847 

HMM  4 

0.776 

0.949 

0.947 

1 

0.787 

0.793 

0.847 

0.844 

0.844 

HMM  5 

0.950 

0.772 

0.777 

0.787 

1 

0.951 

0.799 

0.804 

0.799 

HMM  6 

0.950 

0.776 

0.785 

0.793 

0.951 

1 

0.815 

0.820 

0.809 

HMM  7 

0.798 

0.837 

0.850 

0.847 

0.799 

0.815 

1 

0.945 

0.943 

HMM  8 

0.804 

0.839 

0.847 

0.844 

0.804 

0.820 

0.945 

1 

0.950 

HMM  9 

0.794 

0.835 

0.847 

0.844 

0.799 

0.809 

0.943 

0.950 

1 

that  1  / p1-  are  uniformly  distributed  between  4  and  50.  Sample  paths  Qi,Q§  and  Qq  are 
generated  by  Si.  Q2,Q3 ,  and  Q 4  are  generated  by  S2  while  Qt,Qs  and  Qq  are  generated 
by  S3. 

Finally,  we  cluster  together  all  sample  paths  that  correspond  to  HMMs  with  similarity 
measure  greater  than  a  threshold  value  V.  Note  that  V  corresponds  to  the  required  degree 
of  similarity  for  two  sample  paths  to  be  clustered  together,  like  the  vigilance  parameter  p 
of  ART.  For  example,  if  V  =  0.9,  then  the  similarity  measures  exceeding  V  are: 

Cluster  1:  <t(1,5),  <r(l,6),  u(5,6) 

Cluster  2:  cr(2, 3),  <r(2,4),  <r(3,4) 

Cluster  3:  <t(7,  8),  a( 7,9),  <7(8, 9) 

therefore,  the  HMM  method  has  successfully  classified  all  sample  paths. 


C.7  Conclusion 


In  this  paper,  we  discuss  the  basic  concepts  involved  in  multi-resolution  simulation  modeling 
of  complex  stochastic  systems  and  demonstrate  that,  using  the  average  over  all  available 
high-resolution  simulation  results  as  the  input  to  subsequent  low-resolution  modules  is  inap¬ 
propriate  and  can  lead  to  erroneous  final  results.  Instead  high-resolution  output  data  should 
be  clustered  into  groups  that  match  underlying  features  of  the  system  behavior  before  feed¬ 
ing  group  averages  to  low-resolution  modules.  In  addition,  we  propose  two  approaches  for 
performing  high-dimensional  data  clustering  based  on  neural  networks  and  Hidden  Markov 
Models. 
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Appendix  D 


A  Generalized  ‘Surrogate  Problem’ 
Methodology  for  On-Line 
Stochastic  Discrete  Optimization 


Kagan  Gokbayrak  and  Christos  G.  Cassandras1 
Department  of  Manufacturing  Engineering,  Boston  University,  Boston,  MA  02215 


Abstract 

We  consider  stochastic  discrete  optimization  problems  where  the  decision  variables  are  non¬ 
negative  integers  and  propose  a  generalized  “surrogate  problem”  methodology  that  modifies 
and  extends  previous  work  in  [33] .  Our  approach  is  based  on  an  on-line  control  scheme  which 
transforms  the  problem  into  a  “surrogate”  continuous  optimization  problem  and  proceeds 
to  solve  the  latter  using  standard  gradient-based  approaches  while  simultaneously  updating 
both  actual  and  surrogate  system  states.  In  contrast  to  [33],  the  proposed  methodology 
applies  to  arbitrary  constraint  sets.  It  is  shown  that,  under  certain  conditions,  the  solution 
of  the  original  problem  is  recovered  from  the  optimal  surrogate  state.  Applications  of 
this  approach  include  solutions  to  multicommodity  resource  allocation  problems,  where, 
exploiting  the  convergence  speed  of  the  method,  one  can  overcome  the  obstacle  posed  by 
the  presence  of  local  optima. 
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contract  F30602-99-C-0057  and  by  EPRI/ARO  under  contract  WO8333-03. 
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We  consider  stochastic  discrete  optimization  problems  where  the  decision  variables  are 
non-negative  integers.  Problems  of  this  type  abound,  for  instance,  in  manufacturing  sys¬ 
tems  and  communication  networks.  In  a  manufacturing  system  setting,  examples  include 
the  classic  buffer  allocation  problem,  where  K  buffer  slots  are  to  be  distributed  over  N 
manufacturing  workstations  so  as  to  optimize  performance  criteria  involving  throughput 
or  mean  system  time;  a  variant  of  this  problem  involving  the  use  of  kanban  (rather  than 
buffer  slots)  to  be  allocated  to  different  workstations  [21];  and  determining  the  optimal  lot 
size  for  each  of  N  different  part  types  sharing  resources  in  a  production  facility  with  setup 
delays  incurred  when  a  switch  from  a  lot  of  one  part  type  to  another  occurs  [40].  In  a 
communication  network  setting,  similar  buffer  allocation  issues  arise,  as  well  as  transmis¬ 
sion  scheduling  problems  where  a  fixed  number  of  time  slots  forming  a  “frame”  must  be 
allocated  over  several  call  types  [6].  Such  optimization  problems  are  also  very  common  in 
any  discrete  resource  allocation  setting  [44] ,  as  well  as  in  control  policies  for  Discrete  Event 
Systems  (DES)  that  are  parameterized  by  discrete  variables  such  as  thresholds  or  hedging 
points. 

The  optimization  problem  we  are  interested  in  is  of  the  general  form 

min  Jd{r)  =  E[Ld(r ,  w)]  (D.l) 

r£Ad 

where  r  G  ll  is  a  decision  vector  or  “state”  and  Ad  represents  a  constraint  set.  In  a 
stochastic  setting,  let  Ld(r,co)  be  the  cost  incurred  over  a  specific  sample  path  uj  when 
the  state  is  r  and  Jd(r)  =  E[Ld(r,uj)]  be  the  expected  cost  of  the  system  operating  under 
r.  The  sample  space  is  Q  =  [0, 1]°°,  that  is,  w  6  fi  is  a  sequence  of  random  numbers 
from  [0, 1]  used  to  generate  a  sample  path  of  the  system.  The  cost  functions  are  defined 
as  Ld  :  Ad  x  D  — >  R  and  Jd  :  Ad  — >  R,  and  the  expectation  is  defined  with  respect  to  a 
probability  space  (D,  9,  P)  where  9?  is  an  appropriately  defined  u-field  on  D  and  P  is  a 
conveniently  chosen  probability  measure.  In  the  sequel,  lu’  is  dropped  from  Ld(r,u> )  and, 
unless  otherwise  noted,  all  costs  will  be  over  the  same  sample  path. 

The  problem  (D.l)  is  a  notoriously  hard  stochastic  integer  programming  problem.  Even 
in  a  deterministic  setting,  where  Jd{r)  =  Ld(r),  this  class  of  problems  is  NP-hard  (see  [44], 
[68]  and  references  therein).  In  some  cases,  depending  upon  the  form  of  the  objective  func¬ 
tion  Jd(r)  (e.g.,  separability,  convexity),  efficient  algorithms  based  on  finite-stage  dynamic 
programming  or  generalized  Lagrange  relaxation  methods  are  known  (see  [44]  for  a  compre¬ 
hensive  discussion  on  aspects  of  deterministic  resource  allocation  algorithms).  Alternatively, 
if  no  a  priori  information  is  known  about  the  structure  of  the  problem,  then  some  form  of 
a  search  algorithm  is  employed  (e.g.,  Simulated  Annealing  [1],  Genetic  Algorithms  [43]). 

When  the  system  operates  in  a  stochastic  environment  (e.g.,  in  a  resource  allocation 
setting,  users  request  resources  at  random  time  instants  or  hold  a  particular  resource  for  a 
random  period  of  time)  and  no  closed-form  expression  for  E[Ld{r)\  is  available,  the  problem 
is  further  complicated  by  the  need  to  estimate  E[Ld{r)\ .  This  generally  requires  Monte  Carlo 
simulation  or  direct  measurements  made  on  the  actual  system.  Most  known  approaches  are 
based  on  some  form  of  random  search,  as  in  algorithms  proposed  by  Yan  and  Mukai  [87], 
Gong  et  al  [36],  Shi  and  Olafsson  [73].  Another  recent  contribution  to  this  area  involves 
the  ordinal  optimization  approach  presented  in  [42]  and  used  in  [18]  to  solve  a  class  of 
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resource  allocation  problems.  Even  though  the  approach  in  [18]  yields  a  fast  algorithm,  it  is 
still  constrained  to  iterate  so  that  every  step  involves  the  transfer  of  no  more  than  a  single 
resource  from  one  user  to  some  other  user.  One  can  expect,  however,  that  much  faster 
improvements  can  be  realized  in  a  scheme  allowed  to  reallocate  multiple  resources  from 
users  whose  cost-sensitivities  are  small  to  users  whose  sensitivities  are  much  larger.  This  is 
precisely  the  rationale  of  most  gradient-based  continuous  optimization  schemes,  where  the 
gradient  is  a  measure  of  this  sensitivity. 

With  this  motivation  in  mind,  a  new  approach  was  proposed  in  [33]  based  on  the  fol¬ 
lowing  idea:  The  discrete  optimization  problem  (D.l)  is  transformed  into  a  “surrogate” 
continuous  optimization  problem  which  is  solved  using  standard  gradient-based  methods; 
its  solution  is  then  transformed  back  into  a  solution  of  the  original  problem.  Moreover,  this 
process  is  designed  explicitly  for  on-line  operation.  That  is,  at  every  iteration  step  in  the 
solution  of  the  surrogate  continuous  optimization  problem,  the  surrogate  continuous  state 
is  immediately  transformed  into  a  feasible  discrete  state  r.  This  is  crucial,  since  whatever 
information  is  used  to  drive  the  process  (e.g.,  sensitivity  estimates)  can  only  be  obtained 
from  a  sample  path  of  the  actual  system  operating  under  r.  It  was  shown  in  [33]  that  for  re¬ 
source  allocation  problems,  where  the  constraint  set  is  of  the  form  Ad  =  |r  :  r*  =  AT  j, 

the  solution  of  (D.l)  can  be  recovered  from  the  solution  of  the  surrogate  continuous  opti¬ 
mization  problem;  the  latter  is  obtained  using  a  stochastic  approximation  algorithm  which 
converges  under  standard  technical  conditions. 

The  contributions  of  this  paper  are  the  following.  First,  we  generalize  the  method¬ 
ology  presented  in  [33]  to  problems  of  the  form  (D.l)  which  are  not  necessarily  limited 
to  constraints  such  as  =  |r  :  YliLi  r*  =  Aij,  including  the  possibility  of  unconstrained 
problems.  Second,  we  modify  the  approach  developed  in  [33]  in  order  to  improve  its  compu¬ 
tational  efficiency.  In  particular,  computational  efficiency  is  gained  in  the  following  respects: 

1.  A  crucial  aspect  of  the  “surrogate  problem”  method  is  the  fact  that  the  surrogate 
state,  denoted  by  p  G  ,  can  be  expressed  as  a  convex  combination  of  at  most  N  +  1 
points  in  A where  N  is  the  dimensionality  of  r  e  A^.  Determining  such  points 
is  not  a  simple  task.  In  [33],  this  was  handled  using  the  Simplex  Method  of  Linear 
Programming,  which  can  become  inefficient  for  large  values  of  N.  In  this  paper,  we 
show  that  for  any  surrogate  state  p.  a  selection  set  S(p)  of  such  N  +  1  points,  not 
necessarily  in  Ad,  can  be  identified  through  a  simple  algorithm  of  linear  complexity. 
Moreover,  this  algorithm  applies  to  any  problem  of  the  form  (D.l),  not  limited  to  any 
special  type  of  constraint  set  A([. 

2.  In  solving  the  surrogate  continuous  optimization  problem,  a  surrogate  objective  func¬ 
tion  is  defined  whose  gradient  is  estimated  in  order  to  drive  a  stochastic  approximation 
type  of  algorithm.  The  gradient  estimate  computation  in  [33]  involves  the  inversion 
of  an  N  x  N  matrix.  In  this  paper,  we  show  that  this  is  not  needed  if  one  makes  use 
of  the  selection  set  S(p)  mentioned  above,  and  the  gradient  estimate  computation  is 
greatly  simplified. 

The  price  to  pay  for  the  generalization  of  the  approach  is  the  difficulty  in  establishing 
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a  general  result  regarding  the  recovery  of  the  solution  of  (D.l)  from  the  solution  of  the 
surrogate  problem  as  was  done  in  our  earlier  work  [33] .  We  are  able,  however,  to  still  do  so 
for  two  interesting  cases.  Despite  this  difficulty,  the  empirical  evidence  to  date  indicates  that 
this  generalized  methodology  provides  the  optimal  solutions  under  appropriate  technical 
conditions  guaranteeing  convergence  of  a  stochastic  approximation  scheme. 

A  third  contribution  of  this  paper  is  in  tackling  a  class  of  particularly  hard  multicommod¬ 
ity  discrete  optimization  problems,  where  multiple  local  optima  typically  exist.  Exploiting 
the  convergence  speed  of  the  surrogate  method,  we  present,  as  an  application  of  the  proposed 
approach,  a  systematic  means  for  solving  such  combinatorially  hard  problems. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  D.l,  we  give  an  overview  of  our 
basic  approach.  In  Section  D.2,  we  present  the  key  results  enabling  us  to  transform  a  discrete 
stochastic  optimization  problem  into  a  “surrogate”  continuous  optimization  problem.  In 
Section  D.3,  we  discuss  the  construction  of  appropriate  “surrogate”  cost  functions  for  our 
approach  and  the  evaluation  of  their  gradients.  Section  D.4  discusses  how  to  recover  the 
solution  of  the  original  problem  from  that  of  the  “surrogate”  problem.  In  Section  D.5,  we 
present  the  detailed  optimization  algorithm  and  discuss  its  convergence  properties.  Some 
numerical  examples  and  applications  are  presented  in  Section  D.6. 


D.l  Basic  approach  for  on-line  control 


In  the  sequel,  we  shall  adopt  the  following  notational  conventions  as  in  [33].  We  shall  use 
subscripts  to  indicate  components  of  a  vector  (e.g.,  r*  is  the  ith  component  of  r).  We  shall 
use  superscripts  to  index  vectors  belonging  to  a  particular  set  (e.g.,  r3  is  the  jth  vector 
within  a  subset  of  that  contains  such  vectors).  Finally,  we  reserve  the  index  n  as  a 
subscript  that  denotes  iteration  steps  and  not  vector  components  (e.g.,  rn  is  the  value  of  r 
at  the  nth  step  of  an  iterative  scheme,  not  the  nth  component  of  r). 

The  expected  cost  function  Jd{r )  is  generally  nonlinear  in  r,  a  vector  of  integer- valued 
decision  variables,  therefore  (D.l)  is  a  nonlinear  integer  programming  problem.  One  com¬ 
mon  method  for  solving  this  problem  is  to  relax  the  integer  constraints  on  all  r*  so  that 
they  can  be  regarded  as  continuous  (real-valued)  variables  and  then  to  apply  standard  op¬ 
timization  techniques  such  as  gradient-based  algorithms.  Let  the  “relaxed”  set  Ac  contain 
the  original  constraint  set  Ad  and  define  Lc  :  x  SI  ->  K  to  be  the  cost  function  over  a 

specific  sample  path.  As  before  let  us  drop  ‘cu’  from  Lc(p,uj)  and  agree  that  unless  otherwise 
noted  all  costs  will  be  over  the  same  sample  path.  The  resulting  “surrogate”  problem  then 
becomes:  Find  p*  that  minimizes  the  “surrogate”  expected  cost  function  Jc  :  R+  — >  R  over 
the  continuous  set  Ac,  i.e., 


Jc(p*)  =  min  Jc(p)  =  E[Lc(p )]  (D.2) 

peAc 

where  p  £  R+ ,  is  a  real- valued  state,  and  the  expectation  is  defined  on  the  same  probability 
space  (D,  A,  P )  as  described  earlier.  Assuming  an  optimal  solution  p*  can  be  determined, 
this  state  must  then  be  mapped  back  into  a  discrete  vector  by  some  means  (usually,  some 
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form  of  truncation).  Even  if  the  final  outcome  of  this  process  can  recover  the  actual  r*  in 
(D.l),  this  approach  is  strictly  limited  to  off-line  analysis:  When  an  iterative  scheme  is  used 
to  solve  the  problem  in  (D.2)  (as  is  usually  the  case  except  for  very  simple  problems  of  limited 
interest),  a  sequence  of  points  {pn}  is  generated;  these  points  are  generally  continuous  states 
in  Ac,  hence  they  may  be  infeasible  in  the  original  discrete  optimization  problem.  Moreover, 
if  one  has  to  estimate  E[Lc(p)\  or  dE^c^\  through  simulation,  then  a  simulation  model  of 
the  surrogate  problem  must  be  created,  which  is  also  not  generally  feasible.  If,  on  the  other 
hand,  the  only  cost  information  available  is  through  direct  observation  of  sample  paths  of 
an  actual  system,  then  there  is  no  obvious  way  to  estimate  E\Lc(p)\  or  dE[Lc(p)]  ;  since  this 
applies  to  the  real-valued  state  p,  not  to  the  integer-valued  actual  state  r. 

As  in  [33] ,  we  adopt  here  a  different  approach  intended  to  operate  on  line.  In  particular, 
we  still  invoke  a  relaxation  such  as  the  one  above,  i.e.,  we  formulate  a  surrogate  continuous 
optimization  problem  with  some  state  space  Ac  C  M+  and  A d  C  Ac.  However,  at  every 
step  n  of  the  iteration  scheme  involved  in  solving  the  problem,  both  the  continuous  and 
the  discrete  states  are  simultaneously  updated  through  a  mapping  of  the  form  rn  =  fn(pn)- 
This  has  two  advantages:  First,  the  cost  of  the  original  system  is  continuously  adjusted 
(in  contrast  to  an  adjustment  that  would  only  be  possible  at  the  end  of  the  surrogate 
minimization  process);  and  second,  it  allows  us  to  make  use  of  information  typically  needed 
to  obtain  cost  sensitivities  from  the  actual  operating  system  at  every  step  of  the  process. 

The  basic  scheme  we  consider  is  the  same  as  in  [33]  and  is  outlined  below  for  the  sake 
of  self-sufficiency  of  the  paper.  Initially,  we  set  the  “surrogate  system”  state  to  be  that  of 
the  actual  system  state,  i.e., 

Po  =  ro  (D.3) 

Subsequently,  at  the  nth  step  of  the  process,  let  Hn(pn,  rn,ujn)  denote  an  estimate  of  the 
sensitivity  of  the  cost  Jc(pn)  with  respect  to  pn  obtained  over  a  sample  path  u>n  of  the 
actual  system  operating  under  allocation  ?’n;  details  regarding  this  sensitivity  estimate  will 
be  provided  later  in  the  paper.  Two  sequential  operations  are  then  performed  at  the  nth 
step: 

1.  The  continuous  state  pn  is  updated  through 

Pn+ 1  =  7Tn+l [pn  ~  VnHn{pn ,  rn,^n )]  (D.4) 

where  7rn+i  :  KN  — ►  Ac  is  a  projection  function  so  that  pn+1  €  Ac  and  pn  is  a  “step 
size”  parameter. 

2.  The  newly  determined  state  of  the  surrogate  system,  pn+ 1;  is  transformed  into  an 
actual  feasible  discrete  state  of  the  original  system  through 

Tn+l  =  fn+l{pn+l)  (D-5) 

where  fn+ 1  :  Ac  — >  Ad  is  a  mapping  of  feasible  continuous  states  to  feasible  discrete 
states  which  must  be  appropriately  selected  as  will  be  discussed  later. 
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One  can  recognize  in  (D.4)  the  form  of  a  stochastic  approximation  algorithm  (e.g.,  [50]) 
that  generates  a  sequence  {pn\  aimed  at  solving  (D.2).  However,  there  is  an  additional 
operation  (D.5)  for  generating  a  sequence  {rn}  which  we  would  like  to  see  converge  to  r* 
in  (D.l).  It  is  important  to  note  that  {rn}  corresponds  to  feasible  realizable  states  based 
on  which  one  can  evaluate  estimates  Hn(pn,  rn,  cun)  from  observable  data,  i.e. ,  a  sample 
path  of  the  actual  system  under  rn  (not  the  surrogate  state  pn).  We  can  therefore  see  that 
this  scheme  is  intended  to  combine  the  advantages  of  a  stochastic  approximation  type  of 
algorithm  with  the  ability  to  obtain  sensitivity  estimates  with  respect  to  discrete  decision 
variables.  In  particular,  sensitivity  estimation  methods  for  discrete  parameters  based  on 
Perturbation  Analysis  (PA)  and  Concurrent  Estimation  [41], [20]  are  ideally  suited  to  meet 
this  objective. 

Before  addressing  the  issue  of  obtaining  estimates  Hn(pn,rn,wn )  necessary  for  the  opti¬ 
mization  scheme  described  above  to  work,  there  are  two  other  crucial  issues  that  form  the 
cornerstones  of  the  proposed  approach.  First,  the  selection  of  the  mapping  fn+i  in  (D.5) 
must  be  specified.  Second,  a  surrogate  cost  function  Lc(p,  ui)  must  be  identified  and  its 
relationship  to  the  actual  cost  Ld{r,iv)  must  be  made  explicit.  These  issues  are  discussed 
next,  in  Sections  D.2  and  D.3  respectively  for  the  problem  (D.l),  which,  as  previously  men¬ 
tioned,  is  not  limited  to  the  class  of  resource  allocation  problems  considered  in  our  earlier 
work  [33]. 


D.2  Continuous-to-discrete  state  transformations 

Let  us  first  define  C(p),  the  set  of  vertices  of  the  unit  “cube”  around  the  surrogate  state  as 

C(p)  =  {r|Vi  n  G  {[ftJ,  \Pi\}} 

where,  for  any  x  £  R,  [x]  and  [xj  denote  the  ceiling  (smallest  integer  >  x)  and  floor 
(largest  integer  <  x)  of  x  respectively.  Note  that  when  p.t  £  Z,  all  the  ith  components  of 
the  cube  elements  are  the  same  (=  pf)  decreasing  the  dimension  of  the  cube  by  one.  In 
order  to  avoid  the  technical  complications  due  to  integer  components  in  p.  let  us  agree  that 
whenever  this  is  the  case  we  will  perturb  the  integer  components  to  obtain  a  new  state  p 
whose  components  are  non-integer,  and  then  relabel  this  state  as  p. 

Next,  we  define  A f(p),  the  set  of  all  feasible  neighboring  discrete  states  in  C(p)  as: 

N(p)  =  C(p)  n  Ad  (D.6) 

A  more  explicit  and  convenient  characterization  of  the  set  N{p)  is 

Nip)  =  {r|r  =  [p\  +r  for  all  f  £  {0, 1}^}  D  Ad 

where  [pj  is  the  vector  whose  components  are  [pjj  =  [pjj.  In  other  words,  N{p)  is  the  set 
of  vertices  of  the  unit  “cube”  containing  p  that  are  in  the  feasible  discrete  set  Ad- 

In  earlier  work  (see  [33] ) ,  we  limited  ourselves  to  resource  allocation  problems  with  linear 
capacity  constraints.  For  this  class  of  problems,  we  used  Ac  =  conv{Ad)  as  the  feasible  set 
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in  the  “surrogate”  continuous  state  space.  When  the  feasible  set  Ad  is  not  a  polyhedron, 
the  set  Ac  =  conv(Ad)  may  include  discrete  states  that  are  not  in  Ad-  In  order  to  prevent 
this  and  to  generalize  the  approach,  we  modify  the  definition  of  Ac,  given  the  set  Af(p),  as 
follows: 

Ac  =  U  conv(Af{p ))  (D.7) 

psaJf 

Note  that  Ac  C  conv(Ad)  is  the  union  of  the  convex  hulls  of  feasible  discrete  points  in  every 
cube,  and  is  not  necessarily  convex.  Note  also  that  the  definition  reduces  to  Ac  =  conv{Ad) 
when  Ad  is  formed  by  all  the  discrete  points  in  a  polyhedron. 

Now  we  are  ready  to  define  the  set  of  transformation  functions  Tp  as  follows: 

TP  =  {f\f  :  Ac-^  Ad,  {f{p))i  €  {fpil,  LftJ},  i  =  1, ,  N} 

The  purpose  of  /  E  J-p  is  to  transform  some  continuous  state  vector  p  E  Ac  into  a  “neigh¬ 
boring”  discrete  state  vector  r  £  A f(p)  obtained  by  seeking  \ pf\  or  [pt\  for  each  component 
i  =  1, . . . ,  N.  The  existence  of  such  a  transformation  is  guaranteed  by  the  projection  map¬ 
ping  7 r  in  (D.4),  which  ensures  that  p  E  Ac,  therefore  Af  (p)  is  non-empty.  A  convenient 
element  of  Tp  that  we  shall  use  throughout  the  paper  is 

f{p)  =  arg  min  ||p  —  rj| 
reAf(p) 

which  maps  the  surrogate  state  p  to  the  closest  feasible  neighbor  in  Af  (p) .  However,  our 
analysis  is  certainly  not  limited  to  this  choice. 

A  key  element  of  our  approach  is  based  on  the  fact  that  p  can  be  expressed  as  a  convex 
combination  of  at  most  N  + 1  points  in  C(p),  as  shown  in  Theorem  11  below.  Given  that  the 
cardinality  of  C(p)  is  combinatorially  explosive,  i.e. ,  2N ,  determining  the  set  of  these  points 
is  not  a  simple  task.  In  [33],  it  was  shown  that  such  a  set  of  feasible  points,  A I'n(p),  a  subset 
of  Af(p),  can  be  determined  using  the  Simplex  Method  when  problems  of  the  form  (D.l)  are 
limited  to  constraint  sets  Ad  =  |r  :  ri  =  AT  j.  In  what  follows,  we  provide  a  different 

approach  based  on  defining  a  selection  set  S(p)  which  (a)  allows  us  to  specify  the  N  +  1 
points  in  C(p)  that  define  a  set  whose  convex  hull  includes  p  for  problems  with  arbitrary  Ad, 
(b)  is  much  simpler  than  the  Simplex  Method,  and  (c)  simplifies  the  gradient  estimation 
procedure  as  we  will  see  in  Section  D.3.  An  important  distinction  between  Afi v(p)  and  S(p) 
is  that  the  latter  is  not  limited  to  include  only  feasible  points  r  £  Af(p). 

Definition  1  The  set  S(p)  C  C(p)  is  a  selection  set  if  it  satisfies  the  following  conditions: 
.  \S(p)\  =  N+l 

•  The  surrogate  state  p  resides  in  the  convex  hull  of  S(p),  i.e.,  there  exists  {«i}  such 
that 

N  N 

P  =  ^2  Oiir1,  with  ^2  ai  =  1,  a*  >  0,  rl  £  S(p) 
i= 0  i=0 
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•  The  vectors  in  the  set  {Y*|ri  =  [  1  r*  ]  ,  r'e  5(p)}  are  linearly  independent. 

Next  we  will  show  the  existence  of  the  selection  set  S(p)  by  a  constructive  proof. 
Theorem  11  A  selection  set  S{p)  exists  for  any  p  £  R+ . 

Proof.  We  construct  a  selection  set  S(p)  for  p  =  \px,  ...,  pN]  as  described  below  and  prove 
that  it  satisfies  all  three  conditions  in  Definition  1. 

Let  us  define  e,  as  the  IV-dimensional  unit  vector  whose  ith  component  is  1;  the  residual 
vector  p  =  p  —  [pj;  and  the  IV-dimensional  ordering  vector  o  such  that  Ok  £  {1,...,1V}, 
k  =  1,  and  o &  satisfies 


Pok  <  Pok+1  for  k  =  1, ...,  N  -  1 

Note  that  the  definition  of  p  implies  that  0  <  pj  <  1  for  all  j  =  1  Next,  we  define 


N 

r°l  =  >  e, 


°k 


and 


It  follows  from  (D.9)  that  we  can  write 


k=l 

_  f  p0l-  K-i  1  >  1 


i  =  i 


>  o 


K  =  J2a 


Ok 


k= 1 


N  N 

=  a-i 


and  note  that 

Pon  ~  a°k  ~  ' 

k= 1  j= 1 

Using  (D.9)  we  have  defined  a*  for  i  =  1, . . . ,  N.  In  addition,  we  now  define 

N 

«0  =  1  -  ai  =  1  “  ~PoN  >  0 

i= 1 

Similarly,  (D.8)  defines  fl  for  i  =  1, . . . ,  N.  In  addition,  we  define 

f°  =  0 

where  0  =  [0...0]  is  the  IV-dimensional  zero  vector.  Note  that  we  can  write 

N 


~p=J2~p. 


Ol  eoi 


(D.8) 


(D.9) 


(D.10) 


(D.ll) 


(D.12) 


(D.13) 


l=i 
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(D.14) 


Combining  (D.10)  and  (D.13)  and  changing  the  summation  indices  gives 

N  l  N  N 

P  =  a°k  eoi  =  aok  Soi 


Then,  using  (D.8)  we  get 


P=Y1  a°^°h  =  S  aip 


(D.15) 


Next,  we  define 


Then,  we  can  write 


rJ  =  [p\  +  ^ 


(D.16) 


oarJ  = 


<xj  LpJ + • 


N  N 

ip\  ai +  • 

3=0  3=0 


Observing  that  a3  =  ^  fr°m  (D.ll),  and  that 


c^r7  =  ajrJ  +  aor°  =  p 
3=0  3= 1 

from  (D.12)  and  (D.15),  it  follows  that 


“i7"7  =  L/°J  +  P  =  P 


(D.17) 


i.e.,  the  convex  hull  formed  by  S(p)  =  {r°, ...,  rN},  with  r*  defined  in  (D.16),  contains 
p.  This  satisfies  the  second  condition  in  Definition  1.  Moreover,  from  (D.8),  (D.12),  and 
(D.16),  it  is  obvious  that  |<S(p)|  =  N  +  1,  satisfying  the  first  condition  as  well. 

It  remains  to  show  that  the  vectors  [  1  rl  ]  with  r*  defined  in  (D.16)  are  linearly 
independent.  Consider  the  matrix  [  e  R  ]  where  e  =  [1  •  •  •  if  is  the  (1V+ 1)—  dimensional 
vector  of  Ts  and  R  is  the  matrix  whose  rows  are  vectors  from  S(p)  such  that 


Using  (D.8),  (D.12),  and  (D.16),  one  can  write  [1  r°l  ]  —  [  1  r °l+1  ]  =  [  0  e0l  ]  for 
l  <  N  and  [  1  r°N  ]  —  [  1  r°  ]  =  [  0  eON  ].  Finally,  note  that 


[  1  r°  ]  =  [  1  0]  +  5>.J[0  ] 
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Using  these  arguments  one  can  show  that  the  matrix  [  e  R  ]  can  be  transformed  into  the 
identity  matrix  of  dimension  N  +  1  by  row  operations.  Therefore,  [  e  R  ]  is  non-singular, 
i.e. ,  the  third  condition  of  Definition  1  is  satisfied.  Moreover,  the  inverse  of  [  e  R  ] ,  which 
will  be  needed  during  the  gradient  estimation  part  of  our  approach  in  the  next  section,  can 
be  evaluated  to  give: 


\e  R  ]  1  = 


1  +  I PoN_ 


Pon-1 


\.Pon\ 

— e/v+i— oi  +  &N +2—o\ 


Pon-1 


Pon-1 


[Poi. 


—  CN+1-on  +  &N+2-on 


(D'18) 

where  e*  is  the  (N  +  1)— dimensional  unit  vector  whose  ith  component  is  1  and  o  is  the 
IV-dimensional  ordering  vector  such  that  o *.  e  {1, ...,  N}  satisfying  the  relation 


Oi  =  j  Oj  =  i 


One  can  also  verify  that  e 


R 


above  is  such  that 


e 


R  ]  1  [  e  R  ]  =1.  ■ 


We  stress  that  the  selection  set  S(p)  is  not  unique ;  however,  given  a  selection  set  S(p), 
the  ai  values  are  unique  for  i  =  0, ...,  N.  There  are  clearly  different  ways  one  can  construct 
S(p),  including  randomized  methods.  For  instance,  one  can  start  out  by  randomly  selecting 
the  first  element  of  the  selection  set  from  C(p)  and  then  proceed  through  a  scheme  similar 
to  the  one  used  above. 


The  following  is  an  algorithmic  procedure  for  constructing  S  (p)  as  presented  in  Theorem 

11: 


•  Initialize  the  index  set  I  =  {1, ...,  N}  and  define  a  temporary  vector  v  =  p. 

•  While  /  ±  0: 

1.  rl  =  ej  where  i  =  argmin{uj,  j  S  /} 

2.  ai  =  Vi 

3.  v  <—  v  —  atrl 

4.  I^I\{i} 

•  r°  =  0 

•  a0  =  1  -  2^i=i  ai 

•  S(p)  =  {r*|r*  =  P  +  [p\  for  i  =  0, ...,  N} 


Example:  In  order  to  clarify  our  notation  and  illustrate  the  specification  of  the  sets 
C(p),  Af(p)  and  S(p),  we  provide  the  following  example,  which  we  will  use  throughout  our 


128 


analysis.  Consider  the  allocation  problem  of  K  =  10  resources  over  N  =  3  users,  and  let 
p  =  [3.9,  3.9,  2.2],  The  feasible  set  is 

^  =  {,:5>  =  10}  (D-19) 

Since  [p\  =  [3,3,2],  we  have  the  unit  cube 

C(p)  =  {[3,  3,  2] ,  [3, 3, 3] ,  [3, 4, 2],  [3, 4, 3],  [4, 3,  2] ,  [4, 3,  3] ,  [4, 4, 2] ,  [4, 4, 3] } 
and  the  feasible  neighbors  of  p  are 

N{p)  =  {[3,4,3],  [4,4,2],  [4,3,3]} 

Let  us  now  construct  a  selection  set  satisfying  the  conditions  of  Definition  1  using  the 
algorithm  described  above.  First  we  initialize  the  index  set  /  =  {1,2,3}  and  the  residual 
vector  v  =  p  —  [p\  =  [0.9,  0.9,  0.2],  Note  that  argminjer12)3}{uj}  =  3,  therefore, 

f3  =  ei  =  i  M,l] 

jell, 2, 3} 

«3  =  V3  =  0.2 

Next,  we  update 

v  <—  v  -  a3r3  =  [0.7, 0.7,0] 

/  e-  A{3>  =  {1,2} 

Now,  note  that  the  first  two  components  in  the  updated  v  are  equal.  We  can  select  any  one 
of  them  for  the  next  argmin^g^ ,2}{vj}-  Thus,  if  we  pick  the  first  component  we  get 

r1  =  J2  ei  =  [M>°] 

je{i,2} 

a  i  =  v\  =  0.7 
Proceeding  as  before,  we  update 

v  <—  u-aif1  =  [0,0,0] 

I  <-  /\{1}  =  {2} 

and  finish  this  step  by  setting 

r 2  =  J2  =  I0,1,0] 

je{2} 

«2  =  V2  =  0 

Finally, 

v  *—  v  —  C12  r2  =  [0, 0, 0] 

I  ^  A{2}  =  {} 
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We  may  now  construct  the  selection  set  from  the  vectors  given  by  (D.16): 


i.e. 


r1  =  fx+  LpJ  =  [1,1,0]  +  [3,3,2]  =  [4,4,2] 

r2  =  f2+  [p\  =  [0,1,0]  +  [3,3,2]  =  [3,4,2] 

r3  =  f3+  [p\  =  [1,1,1]  +  [3,3,2]  =  [4,4,3] 

r°  =  f°+  [p\  =  [0,0,0]  +  [3,3,2]  =  [3,3,2] 


S(p)  =  {[3, 3, 2],  [3, 4, 2],  [4, 4, 2],  [4, 4, 3]} 


The  example  above  illustrates  the  important  difference  between  the  selection  set  Nn(p) 
employed  in  [33]  and  the  present  construction:  While  all  the  elements  of  the  set  Mn{p)  are 
feasible,  the  elements  of  the  selection  set  S(p )  constructed  above  may  be  infeasible.  The 
following  lemma  considers  resource  allocation  problems  with  total  capacity  constraints  as  a 
special  case  of  discrete  stochastic  optimization  and  asserts  that  there  is  always  exactly  one 
feasible  point  in  S(p). 


Lemma  12  For  problems  (D.l)  with  a  feasible  set  Ad  =  jr  :  YliLi  ri  =  hi.  r  G  Z+j,  the 
selection  set  S(p)  constructed  above  includes  one  and  only  one  feasible  point.  Moreover, 
this  point  is  the  argument  of  minrg_yv(p)  \\p  —  r||. 

Proof.  Since  p  6  Ac,  it  follows  from  (D.7)  that  there  exist  {aj}^£1  that  satisfy 


M 


M 


Pj  =  ^  c^r*,  rl  G  A f(p),  ^  a*  =  1,  cti  >  0  for  i  =  1, M 

i=  1 


i— 1 


where  M  =  \AT(p)\.  Therefore, 


N  N  M  M  N  M 

Hpj  =  Y.Y,air)  =  =  HaiK  =  K 

j=  1  j  =  1  2=1  2=1  j  =  1  2=1 


Then,  we  can  write 


N 


N 


Y, L pj\  <K<Y\pj 


3= 1 


3= 1 


where  the  equality  only  holds  for  integer  allocations.  Since  we  agreed  that  p  does  not  have 
integer  components, 

N  N 

E  M  <  *  <  E  tel 

3= 1  3=1 


Note  that 


N 


£(fal  -  k-J) =  iV 

1=1 


(D.20) 
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and  define  the  residual  resource  capacity  m,  with  0  <  m  <  N,  as 


m  =  i\  — 


K  -  £  ki 


(D.21) 


Also  note  that  from  (D.12)  and  (D.16) 


and  from  (D.8) 


r°  =  [p\  +  r°  =  [p\  G  S(p) 


r01  =  [pj  +  ?~01  =  [p\  +  ^e0fc  =  |>1  G  «S(p) 


(D.22) 


Now,  observe  that  during  the  construction  of  <S(p), 


r°;  _  r°!+l  =  e 


therefore, 


Using  (D.23), 


Xhr  -  =  1 


i—  1  i=l 


yy  r°i  _  y.oN +i — m  _  jy  _ 


i=l  i=l 


and  it  follows  from  (D.22)  that 


rOAT+l-m  _  |p.]  —  JV  +  m  =  K 


(D.23) 


where  we  have  used  (D.20)  and  (D.21).  Therefore,  r°N+l~rn  G  A /"(p)  and  the  first  part  of 
the  proof  is  complete. 


By  construction  of  the  selection  set  <S(p),  the  elements  r°l  satisfy  the  following 

rT  =  \Pi 1  and  r°l  =  [pj\  =>  pt  >  p,- 


(D.24) 


Therefore,  we  claim  that  r°N+1-rn  is  the  solution  of  the  minimization  problem 


min  ||p-r||  =  ,  V '(p< 
reM(P)  \  ^ 


for  p  G  Ac.  All  elements  r  G  A/"(p)  can  be  characterized  in  terms  of  a  set  _A/fr  of  indices 
defined  as 

Mr  =  {i\n  =  \pi\} 
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where  \Mr\  =  m  for  r  G  N(p).  One  can  then  write  an  equivalent  minimization  problem  as 


N 


min 

reJ\f(p) 


E(ft  - ri 

i=  1 


2 


min 

reJ\f(p) 


N 


N 


E  ( pi~rif+  E 

i— 1  2=1 

.r<=LPiJ  n  =  |>;l 


2 


min 

r&N(p) 


E  ft2  +  E  -  ^)2 

i£l\Mr  i€Mr 


For  r,  r°N+1-m  G  N(p),  the  sets  Atr  and  _Mr°jv+i-m  are  formed  by  m  elements  from  the 
set  {1,...,1V}.  Starting  at  Afr°jv+i-m,  one  can  reach  any  A4r  by  a  series  of  iterations, 
each  iteration  involving  the  removal  of  one  element  from  the  set  Mr°N+i-m\Mr >  and  the 
addition  of  an  element  from  the  set  M.r\M.r°N+ i_m.  If  we  remove  i  and  add  j  while  ~pi  >  pj 
we  increase  the  objective  value  by 


Pi  -Pj  +  i1-  Pj)2  -  (1  -  ft)2  =  2 (ft  -  pj)  >  0 


Since  A ir°N+i-m  has  the  arguments  for  the  m  highest  ft,  we  cannot  decrease  the  distance 
by  moving  to  another  r  G  N(p)  therefore  r°N+1~m  minimizes  the  distance  from  p.  ■ 

Example  (Cont.):  For  the  resource  allocation  problem  of  K  =  10  resources  over  N  =  3 
users,  given  p  =  [3.9,  3.9, 2.2],  we  found  a  selection  set 


S(p)  =  {[3, 3, 2],  [3, 4, 2],  [4, 4, 2],  [4, 4, 3]} 


Observe  that  [4,4,2]  is  the  only  element  of  S{p)  above  which  is  feasible,  consistent  with 
Lemma  12.  In  addition,  note  that  [4,4,2]  is  the  obvious  solution  of  the  minimization 
problem 


min 

r&N(p) 


||/3-  r 


where  J\T{p)  =  {[3,4,3],  [4, 4,  2],  [4,  3,  3]}. 


D.3  Construction  of  surrogate  cost  functions  and  their  gra¬ 
dients 


Since  our  approach  is  based  on  iterating  over  the  continuous  state  p  G  Ac,  yet  drive  the 
iteration  process  with  information  involving  Ld(r)  obtained  from  a  sample  path  under  r, 
we  must  establish  a  relationship  between  L<i{r)  and  Lc(p).  The  choice  of  Lc(p)  is  rather 
flexible  and  may  depend  on  information  pertaining  to  a  specific  model  and  the  nature  of 
the  given  cost  Ld(r). 

Before  defining  Lc(p),  we  shall  concentrate  on  surrogate  cost  functions  Lc(p,S(p),u) 
(which  clearly  depend  on  a  selection  set  and  a  sample  path)  that  satisfy  the  following  two 
conditions: 
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(Cl):  Consistency :  Lc(r,S(r),uj)  =  Ld(r,u)  for  all  r  G  Z 

(C2):  Piecewise  Linearity:  Lc(p,S(p),u)  is  a  linear  function  of  p  over  conv(S(p)). 

In  the  sequel,  the  ‘5(p)’  term  will  be  dropped  along  with  ‘a;’  for  simplicity. 

Consistency  is  an  obvious  requirement  for  Lc(p).  Piecewise  linearity  is  chosen  for  con¬ 
venience,  since  manipulating  linear  functions  over  conv(S(p))  simplifies  analysis,  as  will 
become  clear  in  what  follows. 


Given  some  state  p  G  Ac  and  cost  functions  Ld(r3)  for  all  r3  G  <S(p),  it  follows  from 
(C2)  and  (D.17)  that  we  can  write 

N  N 

P  =  ^2ajrJ  =>  Lc{p)  =  '%2<xjLc(rJ) 

3=0  j= 0 


with  ^2^=0  a3  =  1>  aj  —  0  for  all  j  =  0, ..,  N.  Moreover,  by  (Cl),  we  have 

N 

Lc(p)  =  ^ajL^r3) 

3=0 


(D.25) 


that  is,  Lc(p )  is  a  convex  combination  of  the  costs  of  discrete  neighbors  in  S(p).  Note 
that  although  S(p)  is  not  unique,  given  S(p),  the  values  of  a*  for  i  =  0,  ...,N  are  unique; 
therefore,  Lc(p)  is  well  defined. 


We  now  define  a  surrogate  cost  function  Lc(p)  and  the  corresponding  selection  set  S*(p) 


as 


Lc(p)  =  min  Lc(p) 
S(p) 


(D.26) 


and 

S*(p)  =  argminLc(/9)  (D.27) 

s(p) 

where  the  minimization  is  over  all  possible  selection  sets  for  the  point  p.  The  function  Lc(p ) 
satisfies  the  consistency  condition  (Cl),  but  it  may  not  be  a  continuous  function  due  to 
changes  in  the  selection  set  <S*(-)  for  neighboring  points. 


Next,  if  we  are  to  successfully  use  the  iterative  scheme  described  by  (D.4)-(D.5),  we  need 
information  of  the  form  Hn(pn,rn,Lan)  following  the  nth  step  of  the  on-line  optimization 
process.  Typically,  this  information  is  contained  in  an  estimate  of  the  gradient.  Our  next 
objective,  therefore,  is  to  seek  the  selection-set-dependent  sample  gradient  VLc(p)  expressed 
in  terms  of  directly  observable  sample  path  data. 


D.3.1  Gradient  evaluation 

The  gradient  information  necessary  to  drive  the  stochastic  approximation  part  of  the  surro¬ 
gate  method  is  evaluated  depending  on  the  form  of  the  cost  function.  Gradient  estimation 
for  separable  cost  functions  is  significantly  simpler  and  is  discussed  in  the  next  section. 
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Since  Lc(p)  is  a  linear  function  on  the  convex  hull  defined  by  S(p),  one  can  write 


N 

Lc(p)  =  ^2  PiPi  +  A> 
2=1 


(D.28) 


where 


Pi  = 


dLc(p) 

dpi 


i  =  1, 


,N 


and  Pq  G  M  is  a  constant.  Note  that  the  Pi  values  depend  on  the  selection  set  S(p),  which, 
as  already  pointed  out,  may  not  be  unique. 

For  any  G  <S(p),  one  can  use  (D.28)  and  (Cl)  to  write 

N 

Ld(rJ)  =  ^Pri  +P0 

2=1 


Note  that  the  values  Ld(rJ)  are  either  observable  or  can  be  evaluated  using  Concurrent 
Estimation  or  Perturbation  Analysis  techniques  (see  [41],  [20])  despite  the  fact  that  G 
S(p)  may  be  infeasible,  i.e.,  having  infeasible  points  in  the  selection  set  does  not  affect  our 
ability  to  obtain  gradients.  One  can  now  rewrite  the  equation  above  as 


[  e  R  ]  f3  =  L 

where  e  is  an  (1V+1)  — dimensional  column  vector  with  all  entries  equal  to  1,  P  =  \P0,  ...,PN]', 
R  is  the  matrix  whose  rows  are  G  S(p),  and  L  is  the  column  vector  of  costs  for  these  dis¬ 
crete  states.  Since  we  have  constructed  S(p)  so  that  [  e  R  ]  is  non-singular,  the  gradient 
given  by  \7Lc(p)  =  [/J-^  ...,/3Ar]/,  can  be  obtained  from  the  last  equation  as 

VLc{p)=[  0  I  ]  [  e  R  Y1  L  (D.29) 

where  I  is  the  identity  matrix  of  dimension  N  and  0  is  the  iV- dimensional  vector  of  zeros. 
Substituting  from  equation  (D.18),  the  gradient  can  be  written  as 


VLc(p)  = 

— ejv+i-oi  +  eN+2-d i 

'  Ld(r°) 
Ld{rON) 

■~ejV+i-ojv  +  e.N+2-oN 

_  Ld(r° i)  _ 

Therefore, 

VjLPp)  =  Ld(rj)  -  Ld(rk)  (D.31) 

where  k  satisfies  o3  +  1  =  o *.,  i.e.,  —  rk  =  e3.  As  pointed  out  in  the  Introduction,  this 

is  a  significant  simplification  over  the  gradient  evaluation  used  in  our  earlier  work  [33]. 
Moreover,  this  analysis  allows  us  to  combine  the  algorithm  for  determining  the  selection  set 
given  in  the  previous  section  with  the  gradient  estimation  component  of  our  approach  to 
obtain  the  following: 
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•  Initialize  the  index  set  I  =  {1, N} 

•  rl  =  \p\  where  i  =  arg rnin^g/  p^ 

•  OC  =  Ld(U) 

•  oi  =  i 

•  i  =  AW 

•  While  /  /  0: 

1.  rk  =  r°l  —  e0i  where  k  =  arg  miring  j  pj 

2.  VoiLc(p )  =  OC-  Ld(rk) 

3.  OC  =  Ld(rk ) 

4.  oi  =  k 

5.  I  =  I\{k} 

•  r°  =  [p\ 

.  VmLc(p)  =  OC-  Ld(r°) 


Example  (Cont.):  For  the  example  we  have  been  using  throughout  the  paper,  consider 
the  cost  function 

J(r)  =  ||r-[2,5,3]||2 

Let  pn  =  [3.9, 3.9,  2.2],  for  which  we  found  S(p)  =  {[3,  3,  2],  [3, 4,  2],  [4, 4, 2],  [4, 4, 3]}  with 
r1  =  [4,4,2],  r2  =  [3,4,2],  r3  =  [4,4,3],  and  r°  =  [3,3,2],  The  gradient  at  this  point  can, 
therefore,  be  evaluated  using  (D.31)  to  give 


'  Jd([4,4,2])- Jd([3,4,2])  ' 

3 

VLc(pn)  — 

Jd([3,4,2])-  Jd([3,3,2]) 

= 

-3 

Jd([4,4,3])- Jd([4,4,2]) 

-1 

Using  pn  =  0.5  in  (D.4): 


Pn+l  7I"n+ 1  \_Pn  //r,TLy-:(pnJ] 

=  7rn+1[[2.4,5.4,2.7]] 

=  [2.2,  5.2,  2.6] 

where  we  have  used  the  projection  ir  to  map  the  point  [2.4, 5.4, 2.7]  into  a  feasible  point 
[2.2, 5.2,  2.6]  £  Ac.  For  this  example,  ir  can  be  defined  as  follows: 

7T  [p\  =  arg  min  ||p  —  p|| 

EiLiPi=io 
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D.3.2  Projection  Mapping 


The  projection  mapping  ir  is  a  crucial  element  of  our  method  and  has  a  very  significant 
effect  on  the  convergence.  In  this  section,  we  discuss  a  projection  mapping  which  can  be 
used  for  resource  allocation  problems  with  feasible  sets 

f  N 

Ad  =  Ir-  :^2ri  =  IC,  r6Zf 
l  i= 1 


Consider  the  optimization  problem 


N 


min  V (ft  -  Pi 
P  f-f 
1=1 


2 


subject  to 


ft  >  o 

N 

Eft  =  K 

i= 1 

The  solution  to  this  optimization  problem,  which  we  will  denote  by  n (p),  is  the  closest  point 
in  the  feasible  set  Ac  to  the  point  p.  Note  that  a  ir  projection  to  a  closed  convex  set  defined 
in  this  manner  is  continuous  and  nonexpansive,  therefore  it  guarantees  convergence  (see 
[33]). 

Let  us  consider  the  relaxed  problem 

N 

mm  E  [(ft  “  ft)2  -  Aft]  +  XK 

Pi~  i=  l 

The  necessary  optimality  conditions  are 

[2 (ft  -  ft)  -  A] 

[2 (ft  -  ft)  -  A] 

N 

Eft 

z=i 

or  equivalently 

A  x- 

Pi  =  Pi +2  for  ft  >  0 
A  r 

Pi  >  Pi +2  f°r  Pi  =  ° 

N 

E".  =  K 


=  0  for  Pi  >  0 

>  0  for  Pi  =  0 

=  ft 
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i.e., 

,  A, 

Pi  =  max(0,  Pi  +  —  ) 

N 

Eft  =  K 

i—  1 

These  equations  suggest  the  following  algorithm  for  the  it  projection: 

Projection  Algorithm: 

•  Initialize  A0  =  jj(K  —  YliLi  max(0,  pj) 

•  If  some  stopping  condition  is  not  satisfied: 

1.  For  i  =  1,2,  ...N,  pt  =  max(0,  pt  +  |) 

2.  A  <—  A  +  -  YliL i  Pi) 

•  Set  7 r[p]  =  p. 

A  common  stopping  condition  we  have  used  in  our  work  (see  also  Section  D.6)  is 
k-Eili  p\  <  e,  for  some  small  e  >  0.  Then,  the  vector  p  is  rescaled 


K 


to  satisfy  the  capacity  constraint.  The  error  introduced  while  rescaling  is  small  and  it  is  a 
monotonically  increasing  function  of  e.  Note  that  there  is  a  trade-off  between  the  number 
of  iterations  needed  and  the  size  of  the  resulting  error  term  determined  by  the  selection  of 

£. 


D.3.3  Separable  cost  functions 

Suppose  that  the  cost  function,  Ld(-),  is  separable  in  the  sense  that  it  is  a  sum  of  component 
costs  each  dependent  on  its  local  state  only,  i.e.,  let 

N 

Ld(r)  =  YJLdM )  (D  .32) 

i= 1 
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In  this  case,  our  approach  is  significantly  simplified.  In  particular,  from  (D.31)  and  (D.32), 
we  can  write 

VjLc(p)  =  Ld(ri)  —  Ld(rk) 

N  N 

=  ^2  LdAri)~Yl Ld d  ^ ) 

i= 1  i= 1 

N 

=  DW4)  -  wTi 
1=1 

=  LdJ(rj)  -  Ldj(rj  -  1) 

where  k  satisfies  dj  +  1  =  dk,  i.e. ,  —  rk  =  ej.  Note  that  rj  =  \pj\  and  r j  —  1  =  \_pj\ ; 

therefore, 

VjLc(p)  =  Ldj(  \pj\ )  —  Ldj(  [pjj )  (D.33) 

This  result  indicates  that  for  separable  cost  functions  estimating  sensitivities  does  not  re¬ 
quire  the  determination  of  a  selection  set;  we  can  instead  simply  pick  a  feasible  neighbor 
(preferably  the  closest  feasible  neighbor  to  p)  and  apply  Perturbation  Analysis  (PA)  tech¬ 
niques  to  determine  the  gradient  components  through  (D.33).  There  are  a  number  of  PA 
techniques  developed  precisely  for  evaluating  the  effect  of  decreasing  and  increasing  the 
number  of  resources  allocated  to  user  i;  for  example,  estimating  the  sensitivity  of  packet 
loss  in  a  radio  network  with  respect  to  adding/removing  a  transmission  slot  available  to  the 
ith  user  [19],  [82],  In  [34]  a  PA  technique  is  used  together  with  our  methodology  to  solve 
a  call  admission  problem  (with  a  separable  cost  function)  over  a  communication  network 
where  there  are  capacity  constraints  on  each  node,  while  there  is  no  total  capacity  constraint 
for  the  network. 


D.4  Recovery  of  optimal  discrete  states 

Our  ultimate  goal  remains  the  solution  of  (D.l),  that  is  the  determination  of  r*  G  Ad  that 
solves  this  optimization  problem.  Our  approach  is  to  solve  (D.2)  by  iterating  on  p  G  Ac  and, 
at  each  step,  transforming  p  through  some  /  G  Tp.  The  connection  between  p  and  r  =  /  (p) 
for  each  step  is  therefore  crucial,  as  is  the  relationship  between  p*  and  f(p*)  when  and  if 
this  iterative  process  comes  to  an  end  identifying  a  solution  p*  to  the  surrogate  problem 
(D.2). 

The  following  theorem  identifies  a  key  property  of  the  selection  set  S*(p*)  of  an  optimal 
surrogate  state  p* . 

Theorem  13  Let  p*  minimize  Lc(p)  over  Ac.  If  r*  =  argrnin r&s*(p*)  Ld(r)  G  A f{p*),  i-e., 
the  minimal  cost  element  r*  of  the  selection  set  S*(p*)  corresponding  to  Lc(p*)  is  feasible, 
then  r*  minimizes  Ld(r )  over  Ad  and  satisfies  Ld{r*)  =  Lc(p*). 
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Proof.  By  (D.25),  the  optimal  surrogate  state  p*  =  argminpey4c  Lc(p)  satisfies 

N 

Lc{p*)  =  ^ajL^r3) 

3=0 

where  J2p=o  aj  =  1>  ^  e  <5*(p*),  ay  >  0  for  j  =  0, N.  Then, 

N  N 

Lc(p*)  =  <XjLd(rj)  >  ajLd(r*)  =  Ld(r*)  (D.34) 

3=o  3=o 

regardless  of  the  feasibility  of  r* . 

Next,  note  that  Ad  C  Ac  and  Lc(r)  =  Ld(r )  for  any  r  £  Ad.  Therefore,  if  r*  £  Af(p*), 
then 

Ld(r*)  =  Lc(r*)  >  Lc(p*)  (D.35) 

In  view  of  (D.34)  and  (D.35),  we  then  get 

Ld(r*)  <  Lc(p*)  <  Ld(r*) 

It  follows  that 

Ld(r*)  =  Lc{p*) 

that  is,  r*  is  optimal  over  Ad.  Finally,  since  r*  is  one  of  the  discrete  feasible  neighboring 
states  of  p* .  i.e.  r*  £  Af(p*),  we  have  r*  =  f(p*)  for  some  /  £  Tp*.  ■ 

Corollary  14  For  unconstrained  problems,  let  p*  minimize  Lc(p).  Then, 

r*  =  arg  min  Ld{r) 
re5*(p*) 

minimizes  Ld(r)  and  satisfies  Ld(r*)  =  Lc(p*). 

Proof.  If  problem  (D.l)  is  unconstrained,  then  trivially  r*  =  argminrgi5*(p»)  Ld{r)  £  N(p*) 
and  the  result  follows.  ■ 

An  interesting  example  of  an  unconstrained  problem  is  that  of  lot  sizing  in  manufacturing 
systems  (see  [22])  where  the  sizes  of  lots  of  different  parts  being  produced  may  take  any 
(non-negative)  integer  value.  Clearly,  Corollary  14  also  holds  for  problems  where  the  optimal 
point  is  in  the  interior  of  the  feasible  set  where  the  constraints  are  not  active,  i.e.,  N(p*)  = 
C(p*). 

If  there  are  active  constraints  around  the  optimal  point  p*,  i.e.,  A f(p*)  C(p*),  and  there 
are  infeasible  points  in  the  selection  set  S(p*),  then,  if  one  of  these  infeasible  points  has 
the  minimal  cost,  the  recovery  of  the  optimal  as  a  feasible  neighbor  of  p*  becomes  difficult 
to  guarantee  theoretically,  even  though  empirical  evidence  shows  that  this  is  indeed  the 
case.  This  is  the  price  to  pay  for  the  generalization  of  the  surrogate  problem  method  we 
have  presented  here  through  the  introduction  of  a  selection  set  that  allows  the  inclusion 
of  infeasible  points.  However,  if  the  cost  function  Ld(r )  is  “smooth”  in  some  sense,  the 
minimal  cost  element  of  AT(p*)  will  in  general  be  either  an  optimal  or  a  near-optimal  point 
as  stated  in  the  next  lemma. 
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Cuj  <  OO 


(D.36) 


Lemma  15  If  the  cost  function  Ld(r)  satisfies 

\Ld(rl)  -  Ld(r2) \  <  cw  || r 1  -  r2 
t/ien  all  r  £  A /"(p*)  satisfy 

Ld{r)  <  Lc(p*)  +  CivVn  (D.37) 

Proof.  Note  that  5*(p*)  C  C(p*)  and  A7(p*)  C  C(p*).  It  is  easy  to  show  that  for 
r1,  r2  £  C(p*) 

II?’1  —  r2||  <  y/N 

By  (D.34),  there  exists  r*  £  S*(p*)  that  satisfies  Lc(p*)  >  Ld(r*)  regardless  of  its  feasibility. 
For  r  £  Af(p*),  we  can  write  Lc(p*)  <  Ld(r),  therefore 

I Ld(r)  -  Ld[r*)\  =  Ld(r)  -  Ld(r*)  >  Ld(r )  -  Lc(p*)  (D.38) 

By  assumption  (D.36), 

I Ld(r)  -  Ld(r*)\  <  cu  || r  -  r*||  <  cuVN  (D.39) 

Hence,  from  (D.38)  and  (D.39), 

Ld(r)  <  Lc(p*)  +  CuVN 


In  practice,  for  many  cost  metrics  such  as  throughput  or  mean  system  time  in  queueing 
models,  it  is  common  to  have  the  costs  in  the  neighborhood  of  an  optimal  point  be  relatively 
close,  in  which  case  the  value  of  c w  is  small  and  (D.37)  is  a  useful  bound. 


D.5  Optimization  Algorithm 


Summarizing  the  results  of  the  previous  sections  and  combining  them  with  the  basic  scheme 
described  by  (D.4)-(D.5),  we  obtain  the  following  optimization  algorithm  for  the  solution 
of  the  basic  problem  in  (D.l): 

•  Initialize  p0  =  ro  and  perturb  p0  to  have  all  components  non-integer. 

•  For  any  iteration  r?  =  0,1,...: 

1.  Determine  S(pn)  [using  the  construction  of  Theorem  11;  recall  that  this  set  is 
generally  not  unique]. 

2.  Select  fn  £  TPn  such  that  rn  =  arg minreA/-(p j  ||r  -  pn\\  =  fn{pn)  G  N{pn)- 

3.  Operate  at  rn  to  collect  Ld(rl )  for  all  rl  £  S(pn)  [using  Concurrent  Estimation  or 
some  form  of  Perturbation  Analysis;  or,  if  feasible,  through  off-line  simulation] . 

4.  Evaluate  \7Lc(pn)  [using  (D.31)]. 
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5.  Update  the  continuous  state:  pn+1  =  TTn+i[pn  —  Lc(pn)\. 

6.  If  some  stopping  condition  is  not  satisfied,  repeat  steps  for  n  +  1.  Else,  set 
P  Pn+ 1  • 

•  Obtain  the  optimal  (or  the  near  optimal)  state  as  one  of  the  neighboring  feasible  states 
in  the  set  A f(p*). 


Note  that  for  separable  cost  functions,  steps  1-6  can  be  replaced  by 


1.  Select  fn  such  that  rn  =  arg minreA/-(p j  || r  -  pn\\  =  fn{pn)  G  N{pn)- 

2.  Operate  at  rn  to  evaluate  \7Lc(pn )  using  Perturbation  Analysis  and  (D.33). 

3.  Update  the  continuous  state:  pn+1  =  7rn+ 1 [pn  —  r]n\/  Lc(pn)\. 

4.  If  some  stopping  condition  is  not  satisfied,  repeat  steps  for  n+1.  Else,  set  p*  =  pn+  { . 


The  surrogate  part  of  this  algorithm  is  a  stochastic  approximation  scheme  with  projec¬ 
tion  whose  convergence  was  analyzed  in  [33]  and  references  therein. 

Note  that  ideally  we  would  like  to  have  VJc(pn)  be  the  cost  sensitivity  driving  the 
algorithm.  Since  this  information  is  not  always  available  in  a  stochastic  environment  and 
since  Jc{pn )  =  E[Lc(pn,to)],  the  stochastic  approximation  algorithm  uses  \7Lc(pn,to)  as  an 
estimate  and  under  some  standard  assumptions  on  the  estimation  error  en  where 

^  Jc{.Pn)  =  ^Lc(pn,  to)  +  en 

the  convergence  is  guaranteed.  In  order  to  get  VLc(pn,w),  however,  one  needs  to  consider 
all  possible  selection  sets.  In  this  algorithm  we  utilize  only  one  of  those  selection  sets 
and  approximate  VLc(pn,io)  with  VLc(pn,  S(pn),u).  This  approximation  introduces  yet 
another  error  term  en  where 

VTc(pn,  to)  =  VLc(pn,  S(pn),(o)  +  £n 

Note  that  this  error  term  en  exists  regardless  of  stochasticity,  unless  the  cost  function  Ld(.) 
is  separable  (all  selection  sets  will  yield  the  same  sensitivity  for  separable  cost  functions). 
We  can  combine  error  terms  to  define  en  =  £n  +  £n  and  write 

VJc(pn)  =VLc(pn,S(pn),to)  +  en 

If  the  augmented  error  term  en  satisfies  the  standard  assumptions,  then  convergence  of  the 
algorithm  to  the  optimal  follows  in  the  same  way  as  presented  in  [33]. 
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D.6  Numerical  Examples  and  Applications 


We  first  illustrate  our  approach  by  means  of  a  simple  deterministic  example,  followed  by  a 
more  challenging  stochastic  optimization  application  for  a  classical  problem  in  manufactur¬ 
ing  systems. 

Example  1:  Consider  an  allocation  problem  of  K  =  20  resources  over  N  =  4  users  so 
as  to  minimize  the  convex  cost  function  Jd(r)  defined  as 

Jd{r )  =  || r  -  [4, 5, 3,  8]  ||2 


Suppose  the  initial  state  is  p0  =  [1.8, 9.1,  6.2,  2.9].  Note  that  the  set  of  feasible  neigh¬ 
boring  states  A7(p0)  is 


AT(p0)  =  {[2, 10, 6, 2],  [2, 9,  7, 2],  [2, 9, 6, 3],  [1, 10,  7, 2],  [1, 10,  6,  3],  [1, 9,  7,  3]} 


Following  the  steps  shown  in  the  algorithm  of  Section  D.5,  we  have: 
1.  Determine  the  selection  set  S(p0) 


S(p0)  =  { [1, 9, 6, 2] ,  [1,9, 6, 3] ,  [2, 9, 6, 3] ,  [2, 9,  7, 3] ,  [2, 10,  7,  3]} 


2.  Select  r0  =  f0(p0)  e  A(p0): 

r0  =  [2,  9, 6,  3] 

3.  Evaluate  cost  functions  for  states  in  S(p0 ): 

Jd([l,9,6,2])  =  70  Jd([l,9,6,3])  =  59  Jd([2,  9,6,3])  =  54 

Jd([2,9,7,3])  =  61  ^([2, 10,7,3])  =  70 


4.  Evaluate  the  gradient  of  the  cost  at  p0 


Therefore, 


(VJc(p0))i 

(VJc(/9q))2 

(VJc(p0))3 

(VJc(/30))4 


Jd([2,9,6,3])-  Jd([l,9,6,3])  =  -5 
Jd([2,10,7,3])-  Jd([2,9,7,3])  =9 
Jd([2,9,7,3])-  Jd([2,9,6,3])  =  7 
Jd([l,9,6,3])- Jd([l,9,6,2])  =  -ll 


VJc{p0) 


-5 

9 

7 

-11 


5.  Update  the  surrogate  state: 

Pi  =  7n[Po  -  Vo ^Jc{Po)} 
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6.  If  the  stopping  condition  is  not  satisfied,  go  to  step  1  and  repeat  with  pn+\  replacing 
pn  for  n  =  0,1,.... 

Using  a  step  size  sequence  r/n  =  0.5 /(n  +  1),  the  following  table  shows  the  evolution  of 
the  algorithm  for  the  first  few  steps.  Note  that  the  optimal  allocation  [4,  5,  3,  8]  is  reached 
after  a  single  step. 


STEP 

P 

r 

Up) 

J(r) 

0 

[1.800,9.100,6.200,2.900] 

[2, 9, 6, 3] 

56.84 

54 

1 

[4.300,4.600,2.700,8.400] 

[4, 5, 3, 8] 

0.50 

0 

2 

[4.050,4.850,2.950,8.150] 

[4, 5, 3,  8] 

0.05 

0 

Table  D.l:  Optimal  Resource  Allocation 

Example  2:  Consider  a  manufacturing  system  formed  by  five  stages  in  series.  The 
arrival  process  to  the  system  is  Poisson  with  rate  A  =  1.0  and  the  service  processes  are  all 
exponential  with  rates  =  2.0,  p2  =  1-5,  /U3  =  1.3,  p4  =  1.2,  and  p5  =  1.1.  Note  that 
Poisson  arrival  process  and  exponential  service  times  are  not  required  by  the  algorithm. 
They  are  chosen  for  simplicity  of  the  simulations. 

We  would  like  to  allocate  kanban  (tickets)  to  stages  2  —  5,  to  minimize  a  cost  function 
that  has  two  components 

J(r)  =  Ji(r)  +  J2(r) 

where  r  6  Z+  is  the  vector  of  kanban  allocated  to  stages  2  —  5.  The  first  component  J\{r)  is 
the  average  system  time  for  jobs  and  the  second  component  ^(r)  is  a  cost  on  the  number 
of  kanban  allocated  defined  as 


■h(r)  =  c 


i= 1 


For  large  enough  c,  the  second  component  J2(r)  dominates  the  cost;  therefore,  a  capacity 
constraint  of  the  form 

4 

^2ri  =  K 

i= 1 

is  enforced.  The  problem,  then,  can  be  written  as 


min  J\  (r) 

Eti  n=K 

which  was  considered  in  [67]  with  K  =  13.  The  surrogate  method  for  the  same  problem 
performs  as  follows: 

At  each  iteration  we  observe  100  departures  and  use  the  decreasing  step  size  pn  = 

The  optimal  allocation  is  observed  as  [1,3, 4,  5]  which  matches  the  result  from  [67].  It  is 
worthwhile  noting  that  this  optimal  point  is  identified  within  13  iterations,  illustrating  the 
convergence  speed  of  this  method. 
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Iterations 

r 

J(r) 

1 

[3, 3, 3, 4] 

0.798133 

2 

[1,2, 2, 8] 

0.781896 

3 

[1,5, 4, 8] 

0.767171 

4 

[1,4, 6,  7] 

0.746568 

5 

[1,4, 6,  7] 

0.761161 

6 

[1,4, 6, 6] 

0.709394 

7 

[1,3, 5, 6] 

0.827928 

8 

[1,3, 5, 6] 

0.788815 

9 

[1,3, 5, 5] 

0.730709 

10 

[1,3, 5, 6] 

0.742748 

11 

[1,3, 5, 5] 

0.791522 

12 

[1,2, 5, 5] 

0.865436 

13 

[1,3, 4, 5] 

0.795680 

14 

[1,3, 4, 5] 

0.738700 

15 

[1,3, 4, 5] 

0.857133 

16 

[1,3, 4, 5] 

0.679464 

17 

[1,3, 4, 5] 

0.875472 

18 

[1,3, 4, 5] 

0.840447 

Table  D.2:  Optimal  Kanban  Allocation 

D.6.1  Multicommodity  Resource  Allocation  Problems 

An  interesting  class  of  discrete  optimization  problems  arises  when  Q  different  types  of 
resources  must  be  allocated  to  N  users.  The  corresponding  optimization  problem  we  would 
like  to  solve  is 

min  J(r) 
reAd 

where  r  =  [rip, . . . ,  rgQ,  •  •  •  , rjvp,  •  •  • ,  r n,q ]  is  the  allocation  vector  and  niq  is  the  number 
of  resources  of  type  q  allocated  to  user  i.  A  typical  feasible  set  Ad  is  defined  by  the  capacity 
constraints 

N 

^  ]  ri,q  —  Kqi  Q  =  1,  ■  ■  ■  ,Q 

i— 1 

and  possibly  additional  constraints  such  as  <  7^  for  i  =  1, . . . ,  N.  Aside  from 

the  fact  that  such  problems  are  of  higher  dimensionality  because  of  the  Q  different  resource 
types  that  must  be  allocated  to  each  user,  it  is  also  common  that  they  exhibit  multiple  local 
minima.  Examples  of  such  problems  are  encountered  in  operations  planning  that  involve 
N  tasks  to  be  simultaneously  performed,  each  task  i  requiring  a  “package”  of  resources 
(rj,  1, . . . ,  Tt  Q)  in  order  to  be  carried  out.  The  natural  trade-off  involved  is  between  carrying 
out  fewer  tasks  each  with  a  high  probability  of  success  (because  each  task  is  provided 
adequate  resources)  and  carrying  out  more  tasks  each  with  lower  probability  of  success. 

The  “surrogate  problem”  method  provides  an  attractive  means  of  dealing  with  these 
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problems  with  local  minima  because  of  its  convergence  speed.  Our  approach  for  solving  these 
problems  is  to  randomize  over  the  initial  states  ro  (equivalently,  p0)  and  seek  a  (possibly 
local)  minimum  corresponding  to  this  initial  point.  The  process  is  repeated  for  different, 
randomly  selected,  initial  states  so  as  to  seek  better  solutions.  For  deterministic  problems, 
the  best  allocation  seen  so  far  is  reported  as  the  optimal.  For  stochastic  problems,  we  adopt 
the  stochastic  comparison  approach  in  [36] .  The  algorithm  is  run  from  a  randomly  selected 
initial  point  and  the  cost  of  the  corresponding  final  point  is  compared  with  the  cost  of  the 
“best  point  seen  so  far”.  The  stochastic  comparison  test  in  [36]  is  applied  to  determine 
the  “best  point  seen  so  far”  for  the  next  run.  Therefore,  the  surrogate  problem  method 
can  be  seen  as  a  complementary  component  for  random  search  algorithms  that  exploits  the 
problem  structure  to  yield  better  generating  probabilities  (as  discussed  in  [36]),  which  will 
eliminate  (or  decrease)  the  visits  to  poor  allocations  enabling  them  to  be  applied  on-line. 

In  what  follows  we  consider  a  problem  with  iV  =  16,  Q  =  2,  and  AT  =  20,  AT  =  8.  We 
then  seek  a  32— dimensional  vector  r  =  [rqq,  rq^,  •  •  •  , ri6,i, . . . ,  r\§i2\  to  maximize  a  reward 
function  of  the  form 

16 

J(r)  =  Mr)  (D.40) 

i= 1 

subject  to 

N  N 

y,  nt  1  <  20,  y  ri) 2  <  8 

i= 1  i=  1 

The  reward  functions  «/*(?’)  we  will  use  in  this  problem  are  defined  as 

Ji  =  ViP?{r )  -  CiriAPl(r)  -  C2n,2P?(r)  (D.41) 

In  (D.41),  Vi  represents  the  “value”  of  successfully  completing  the  ith  task  and  Ff(r)  is 
the  probability  of  successful  completion  of  the  ith  task  under  an  allocation  r.  In  addition, 
Cq  is  the  cost  of  a  resource  of  type  q,  where  q  =  1,2,  and  Ff(r)  is  the  probability  that 
a  resource  of  type  q  is  completely  consumed  or  lost  during  the  execution  of  the  ith  task 
under  an  allocation  r.  A  representative  example  of  a  reward  function  for  a  single  task  with 
Vi  =  150  is  shown  in  Fig.  D.l. 

The  cost  values  of  resource  types  are  Cj  =  20  and  C2  =  40,  and  the  values  for  tasks  we 
will  use  in  this  problem  range  between  50  and  150. 

The  surrogate  method  is  executed  from  random  initial  points  and  the  results  for  some 
runs  are  shown  in  Fig.  D.2.  Note  that  due  to  local  maxima,  some  runs  yield  suboptimal 
results.  However,  in  all  cases  convergence  is  attained  extremely  fast,  enabling  us  to  repeat 
the  optimization  process  multiple  times  with  different  initial  points  in  search  of  the  global 
maximum.  Although  it  is  infeasible  to  identify  the  actual  global  maximum,  we  have  com¬ 
pared  our  approach  to  a  few  heuristic  techniques  and  pure  random  search  methods  and 
found  the  “surrogate  problem”  method  to  outperform  them. 
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Figure  D.l:  A  typical  reward  function  ;  A, 2) 

D.7  Conclusions 


In  this  paper  we  have  generalized  the  methodology  presented  in  [33]  for  solving  stochastic 
discrete  optimization  problems.  In  particular,  we  have  introduced  the  concept  of  a  “selection 
set”  associated  with  every  surrogate  state  p  £  Ac  and  modified  the  definition  of  the  surrogate 
cost  function  Lc(p)  so  that  the  method  can  be  applied  to  arbitrary  constraint  sets  and  is 
computationally  more  efficient. 

As  in  [33],  the  discrete  optimization  problem  was  transformed  into  a  “surrogate”  con¬ 
tinuous  optimization  problem  which  was  solved  using  gradient-based  techniques.  It  was 
shown  that,  under  certain  conditions,  the  solution  of  the  original  problem  is  recovered  from 
the  optimal  surrogate  state.  A  key  contribution  of  the  methodology  is  its  on-line  control 
nature,  based  on  the  actual  data  from  the  underlying  system.  One  can  therefore  see  that 
this  approach  is  intended  to  combine  the  advantages  of  a  stochastic  approximation  type  of 
algorithm  with  the  ability  to  obtain  sensitivity  estimates  with  respect  to  discrete  decision 
variables.  This  combination  leads  to  very  fast  convergence  to  the  optimal  point. 

Using  this  approach,  we  have  also  tackled  a  class  of  particularly  hard  multicommodity 
discrete  optimization  problems,  where  multiple  local  optima  typically  exist.  Exploiting  the 
convergence  speed  of  the  surrogate  method,  we  presented  a  procedure  where  the  algorithm 
is  started  from  multiple  random  initial  states  in  an  effort  to  determine  the  global  optimum. 
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Figure  D.2:  Algorithm  convergence  under  different  initial  points 
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